-
Notifications
You must be signed in to change notification settings - Fork 339
SPL:Reading and Writing XML Data
XML is a commonly used data format. The format features multilevel nodes. In each node, you can define multiple pairs of attribute values. The structure is more complicated than that of a two-dimensional database table. There are difficulties if we are trying to compute XML directly. SPL (Structured Process Language) provides xml() function to parse and generate XML. It is convenient to arrange and compute data using it.
XML does not define explicit data types, so we treat both tag values and attribute values as strings. In the following person.xml, for instance, root is the root node, under which each person node represents a person, and under person, each node records a piece of information about the user:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<person>
<name>Ann</name>
<gender>female</gender>
<birthday>1992-12-03</birthday>
<birthday2>12/03/1992</birthday2>
<age>"29"</age>
<grade>8</grade>
<isForeigner>true</isForeigner>
</person>
</root>
We read in the XML file in SPL:
A | |
---|---|
1 | D:/dataFiles/xml/person.xml |
2 | =file(A1).read() |
3 | =xml(A2) |
A2 reads in the whole XML string from the original file:
Basic SPL data types include string, numeric, boolean (true and false), null and datetime. SPL will auto-match the proper data type when parsing XML, and continue to use the string type when no corresponding type can be found. In the above code, A3 uses SPL xml() function to parse XML by expanding the multiple levels of data. Data in the third level is parsed as a record. name and gender are identified as strings; birthday is identified as date type since it conforms to the default date format (yyyy-MM-dd); birthday2 remains the string type as no corresponding type can be matched for it; as a numeric value, age is forced to be identified as string type because it is enclosed by double quotes; grade is identified as numeric type; and isForeigner is identified as boolean.
When there are multiple person nodes of same structure under root node, the XML is similar to a multirow table recording information of people:
<root>
<person>
<name>Jim</name>
<gender>male</gender>
<birthday>1985-03-15</birthday>
</person>
<person>
<name>Ann</name>
<gender>female</gender>
<birthday>1992-12-03</birthday>
</person>
……
</root>
We parse the XML using SPL xml() function. This time data in the third level is a multirow two-dimensional table:
A | |
---|---|
3 | =xml(A2,"root/person") |
The second parameter in the xml() function specifies the parsing path root/person, which means getting the third level table sequence directly:
<root>
<person>
<name>Jim</name>
<gender>male</gender>
</person>
<person>
<name>Ann</name>
<birthday>1992-12-03</birthday>
</person>
</root>
The first person node does not have birthday attribute, and there is no gender attribute in the second person node. In this case, each person will still be parsed as a record but a table sequence cannot be formed. Instead, the two records of different structures will be returned as a sequence.
<root>
<person>
<name>Jim</name>
<gender>male</gender>
<birthday>1985-03-15</birthday>
</person>
<team>
<name>team1</name>
<persons>Jim,Ann</persons>
</team>
<person>
<name>Ann</name>
<gender>female</gender>
<birthday>1992-12-03</birthday>
</person>
<team>
<name>team2</name>
<persons>Ann</persons>
</team>
</root>
In the second level, besides person nodes, there are also several team nodes. In this case, data of different nodes will be parsed into corresponding table sequences. Two table sequences will be generated correctly even if person nodes and team nodes are not arranged in order.
The XML in the above does not have attribute information. Yet even if there is any, the above blocks of SPL code will ignore them. Now let’s look at how SPL parses the attribute information as needed. We modify person.xml to make it contain the attribute information:
<root >
<person name="Jim" gender="male" birthday="1985-03-15"></person>
<person name="Ann" gender="female" birthday="1992-12-03"></person>
</root>
A | |
---|---|
1 | D:/dataFiles/xml/person.xml |
2 | =file(A1).read() |
3 | =xml@s(A2) |
xml() function uses @s option to switch to another parsing style, which makes each XML tag a record and uses tag names and attributes as its structure:
In the above table sequence, person values are nulls because the corresponding tag values are empty. Now we take person IDs as tag values and get the following result:
<root >
<person name="Jim" gender="male" birthday="1985-03-15">10001</person>
<person name="Ann" gender="female" birthday="1992-12-03">10002</person>
</root>
The tag values can be empty or a basic data type, like 10001. A tag can also have sub-tags, and the corresponding values are a nested table sequence:
<root>
<person name="Jim" gender="male" birthday="1985-03-15">
<buy num="5" price="$3.2">milk</buy>
<buy num="2" price="$1.8">apple</buy>
</person>
<person name="Ann" gender="female" birthday="1992-12-03">
<buy num="10" price="$1.5">egg</buy>
</person>
</root>
The inverse operation of parsing XML is to generating XML from an SPL object. SPL also uses xml() to perform the generation. The function judges type of the passed-in parameter and auto-selects the proper conversion action (It converts XML into an SPL object when an XML string is passed in, and an SPL object into an XML string when an SPL object is passed in). We convert the above parsed XML directly back to XML and observe the differences in the two opposite operations.
Below is the original XML:
<root>
<person>
<name>Ann</name>
<gender>female</gender>
<birthday>1992-12-03</birthday>
<birthday2>12/03/1992</birthday2>
<age>"29"</age>
<grade>8</grade>
<isForeigner>true</isForeigner>
</person>
</root>
In A3, xml() function converts A2’s XML string into an SPL object. A4 generates an XML string directly form A3’s SPL object:
A | |
---|---|
1 | D:/dataFiles/xml/person.xml |
2 | =file(A1).read() |
3 | =xml(A2) |
4 | =xml(A3) |
A4 gets a result XML string that has same structure as the original XML string. All string values are enclosed by double quotes, as shown below:
<root>
<person>
<name>"Ann"</name>
<gender>"female"</gender>
<birthday>1992-12-03</birthday>
<birthday2>"12/03/1992"</birthday2>
<age>"29"</age>
<grade>8</grade>
<isForeigner>true</isForeigner>
</person>
</root>
Below is the original XML:
<root>
<person>
<name>Jim</name>
<gender>male</gender>
<birthday>1985-03-15</birthday>
</person>
<person>
<name>Ann</name>
<gender>female</gender>
<birthday>1992-12-03</birthday>
</person>
</root>
We parse the XML and write it to a target XML file. The result has the same structure as the original:
<root>
<person>
<name>"Jim"</name>
<gender>"male"</gender>
<birthday>1985-03-15</birthday>
</person>
<person>
<name>"Ann"</name>
<gender>"female"</gender>
<birthday>1992-12-03</birthday>
</person>
</root>
Below is the original XML:
<root>
<person>
<name>Jim</name>
<gender>male</gender>
<birthday>1985-03-15</birthday>
</person>
<person>
<name>Ann</name>
<gender>female</gender>
<birthday>1992-12-03</birthday>
</person>
</root>
A | |
---|---|
1 | D:/dataFiles/xml/person.xml |
2 | =file(A1).read() |
3 | =xml(A2) |
4 | =xml(A3,"persons") |
In A4, the second parameter in xml() function specifies the root path to which the XML is written:
<persons>
<root>
<person>
<name>"Jim"</name>
<gender>"male"</gender>
<birthday>1985-03-15</birthday>
</person>
<person>
<name>"Ann"</name>
<gender>"female"</gender>
<birthday>1992-12-03</birthday>
</person>
</root>
</persons>
Note that there are root nodes under the first persons node. If you want to remove it and make person node directly under persons, you can replace the output SPL object with A3’s root field value:
A | |
---|---|
4 | =xml(A3.root,"persons") |
<persons>
<person>
<name>"Jim"</name>
<gender>"male"</gender>
<birthday>1985-03-15</birthday>
</person>
<person>
<name>"Ann"</name>
<gender>"female"</gender>
<birthday>1992-12-03</birthday>
</person>
</persons>
Below is the original XML:
<root>
<person>
<name>Jim</name>
<gender>male</gender>
</person>
<person>
<name>Ann</name>
<birthday>1992-12-03</birthday>
</person>
</root>
The parsed XML is not the standard table sequence and the generation back to XML will fail.
Below is the original XML:
<root>
<person>
<name>Jim</name>
<gender>male</gender>
<birthday>1985-03-15</birthday>
</person>
<team>
<name>team1</name>
<persons>Jim,Ann</persons>
</team>
<person>
<name>Ann</name>
<gender>female</gender>
<birthday>1992-12-03</birthday>
</person>
<team>
<name>team2</name>
<persons>Ann</persons>
</team>
</root>
The result of writing the parsed XML has basically the same structure because data in the same nodes is parsed into same table sequence in order:
<root>
<person>
<name>"Jim"</name>
<gender>"male"</gender>
<birthday>1985-03-15</birthday>
</person>
<person>
<name>"Ann"</name>
<gender>"female"</gender>
<birthday>1992-12-03</birthday>
</person>
<team>
<name>"team1"</name>
<persons>"Jim,Ann"</persons>
</team>
<team>
<name>"team2"</name>
<persons>"Ann"</persons>
</team>
</root>
Below is the original XML:
<root >
<person name="Jim" gender="male" birthday="1985-03-15">10001</person>
<person name="Ann" gender="female" birthday="1992-12-03">10002</person>
</root>
The result of writing the parsed XML of the above attribute type has a very different structure. This also shows the differences between the two parsing types:
<xml>
<row>
<root>
<person>10001</person>
<name>"Jim"</name>
<gender>"male"</gender>
<birthday>1985-03-15</birthday>
</root>
<root>
<person>10002</person>
<name>"Ann"</name>
<gender>"female"</gender>
<birthday>1992-12-03</birthday>
</root>
</row>
</xml>
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code