-
Notifications
You must be signed in to change notification settings - Fork 340
From JSONPath and XPath to SPL
Both XML and JSON are structured text formats good at recording hierarchical data that carries general and rich information. They are often used to interchange data and transmit information in systems like WebService/RESTful and Microservices. The hierarchical data has much more complicated structure than the traditional two-dimensional tables, and are difficult to process after being retrieved.
In the days when there were no specialized JSON/XML processing technologies, JAVA developers usually turned to hardcoding or loaded JSON/XML data to the database for further processing. Hardcoding has weak computing ability, produces too much code and has very slow development efficiency. Loading data to the database helps to obtain database computing ability for achieving the rest part of the processing, but the approach involves many steps and long delay time, results in undesired tight coupling between the Java program and the database, and creates an inefficient framework. Moreover, databases are not good at handling hierarchical structured data though they excel at computing two-dimensional tables. This situation remained until specialized technologies appeared one by one and considerably increased the efficiency of JSON/XML data processing in Java. JSONPath and XPath are considered the best among them.
XPath is a widely used XML processing language that is encapsulated in function libraries such as XOM, Xerces-J, Jdom and Dom4J. JSONPath, now the extensively-used JSON processing language, imitates XPath syntax to achieve similar functionalities and is equipped with its own function libraries. Compared with hardcoding, XPath and JSONPath generate short code and have unprecedentedly strong computing ability.
One instance is to get XML string from WebService using arronlong HTTP function library and parse it to Document type using Dom4J function library, and perform a conditional query with Dom4J’s built-in XPath syntax:
String path= "http://.../emp_orders.xml";
String XMLStr= com.arronlong.httpclientutil.HttpClientUtil.get(com.arronlong.httpclientutil.common.HttpConfig.custom().url(path));
Document doc = DocumentHelper.parseText(XMLStr);
List<Node> list=doc.selectNodes("/xml/row/Orders[Amount>1000 and Amount<=3000 and contains(Client,'bro')]")
Similarly, we are trying to use JSONPath to perform a conditional query:
String path= "http://.../emp_orders.json";
String JsonStr= com.arronlong.httpclientutil.HttpClientUtil.get(com.arronlong.httpclientutil.common.HttpConfig.custom().url(path));
Object document = Configuration.defaultConfiguration().jsonProvider().parse(JsonStr);
ArrayList l=JsonPath.read(document, "$[*].Orders[?(@.Amount>1000 && @.Amount<2000 && @.Client =~ /.*?business.*?/i )]");
JSONPath and XPath have similar uses, are syntactically alike and possess almost equal computing abilities. In the following discussions and illustrations, we’ll mainly use JSONPath to explain my point. Both JSONPath and XPath give a relatively complete support for conditional queries. The support covers comparison operators such as greater than and greater than or equal to, logical operators such as logical AND, logical OR and logical NOT, string type regular expressions like ~ /.*?business.*?/i, and string functions such as contains for performing fuzzy matches. JSONPath and XPath also allows using mathematical operators (functions) such as +-*and div, position functions such as position and last, and date functions such as year-from-dateand timezone-from-time in conditional queries.
One thing worthy of note is that JSONPath and XPath can flexibly express the range of hierarchical levels in conditional queries, including absolute position, relative position, parent node, child node, attribute and element. This feature, as shown by $[*].Orders and /xml/row/Orders in the above code, distinguishes the hierarchical data processing languages from the two-dimensional data processing language (SQL).
JSONPath and XPath also support aggregate operations. To perform sum operation in JSONPath, for instance:
Double d=JsonPath.read(document, "$.sum($[*].Orders[*].Amount)");
Other aggregate functions they support are average, max, min and count, etc.
As can be seen from these code examples, JSONPath and XPath have intuitive and easy-to-understand syntax, achieve conditional queries with short code, and access the hierarchical data structure conveniently. They are far more convenient to use than hardcoding.
Though JSONPath and XPath have advanced computing abilities compared with Java hardcoding, the abilities fall too far short of demands in handling daily, even basic, computations. They are very much outplayed by SQL. In fact, the two languages only support two basic computations – conditional query and aggregation. To achieve other computations, they still need the turn to complicated coding.
To perform grouping & aggregation in JSONPath, for instance:
ArrayList orders=JsonPath.read(document, "$[*].Orders[*]");
Comparator<HashMap> comparator = new Comparator<HashMap>() {
public int compare(HashMap record1, HashMap record2) {
if (!record1.get("Client").equals(record2.get("Client"))) {
return ((String)record1.get("Client")).compareTo((String)record2.get("Client"));
} else {
return ((Integer)record1.get("OrderID")).compareTo((Integer)record2.get("OrderID"));
}
}
};
Collections.sort(orders, comparator);
ArrayList<HashMap> result=new ArrayList<HashMap>();
HashMap currentGroup=(HashMap)orders.get(0);
double sumValue=(double) currentGroup.get("Amount");
for(int i = 1;i < orders.size(); i ++){
HashMap thisRecord=(HashMap)orders.get(i);
if(thisRecord.get("Client").equals(currentGroup.get("Client"))){
sumValue=sumValue+(double)thisRecord.get("Amount");
}else{
HashMap newGroup=new HashMap();
newGroup.put(currentGroup.get("Client"),sumValue);
result.add(newGroup);
currentGroup=thisRecord;
sumValue=(double) currentGroup.get("Amount");
}
}
As JsonPath/XPath does not support grouping & aggregation directly, programmers have to write their own programs to accomplish the most part of the computation. This requires that programmers be able to control each detail. The code is lengthy and prone to errors. A change of grouping field or field to be aggregated needs a lot of modifications of the original code. To group or summarize data by multiple fields even involves a major overhaul. It is hard to write general code for all scenarios in JsonPath/XPath.
JSONPath and XPath do not support most of the basic computations due to their severe lack of computing abilities. Besides grouping & aggregation, other computations they do not support are renaming, sorting, distinct, join, set-oriented computations, getting Cartesian product, merge, window functions and order-based computations, etc. They also do not support the mechanism that splits a big computing task into multiple basic computations, such as subquery and stepwise computation, making it hard to implement complex computing logics.
Lacking the basic HTTP interface is another problem. Programmers have to write their own code or use a third-party function library like JourWon and Arronlong (the previous code uses Arronlong library) if they use the languages to handle computing tasks. At other times when MongoDB and ElasticSearch, which have their own interface protocols respectively, return hierarchical data, they still need to write their own code or use a third-party class library because JSONPath and XPath do not provide corresponding connectors. This results in complex framework, increases instability and slows down development.
It is JSONPath and XPath’s insufficient computing abilities that leads to low development efficiency. We need a JSON/XML processing technology that has complete and enough computational capability if we want to enhance the efficiency.
SPL is such a technology.
esProc SPL is an open-source structured data/hierarchical data processing language under JVM. It has computational capability as good as SQL thanks to its built-in specialized hierarchical data object and a wealth of computing functions, string functions and date functions. With SPL, we can effectively increase the efficiency of data processing after retrieving data from WebService/RESTful.
SPL encapsulates the specialized hierarchical data object to offer solid low-level support.
To read XML string from an XML file and parse it as a SPL table sequence, for instance:
A | |
1 | =file("d:\\xml\\emp_orders.xml").read() |
2 | =xml(A1,"xml/row") |
Click cell A2 and you can see a hierarchical table sequence structure, where Eid field and State field store data of simple types and Orders field stores sets of records (two-dimensional tables). Click a row under Orders and we can expand the value to view details:
ASPL table sequence is a specialized data object for representing data of any number of hierarchies. Here’s an example:
Table sequences are so competent that they can represent two-dimensional data or hierarchical data originated from any sources, including but not limited to XML, JSON, files and WebService. To read JSON string from a file of same structure as the previous XML file and parse it as a SPL table sequence, for instance:
A | |
1 | =file("d:\\xml\\emp_orders.json").read() |
2 | =json(A1) |
This table sequence has the same structure as the previous one. We can use the same code to perform the subsequence computations. JSON will be used in our following illustrations.
With rich computing functions, SPL achieves a basic computation with one-liner code
To perform a conditional query on hierarchical JSON data, for instance:
A | |
2 | …//Skip data retrieval and parsing |
3 | =A2.conj(Orders) |
4 | =A3.select(Amount>1000 && Amount<=2000 && like@c(Client,"\*business\*")) |
The code shows that SPL has a comprehensive support for conditional queries, covering features of JSONPath and XPath from comparison operators, logical operators, regular expressions to string functions such as like. SPL also supports using mathematical operators (functions), position functions and date functions in conditional queries. The language can access any hierarchy with simple code in a flexible way, like A2.conj(Orders).
It is also simple to achieve various aggregate computations in SPL. To calculate sum for instance, SPL uses =A3.sum(Amount).
SPL supports a rich set of basic computations and have computational ability as good as SQL. To perform a grouping & aggregation, JSONPath and XPath have to use hardcoding but SPL just needs a single line of code:
=A2.conj(Orders).groups(Client;sum(Amount))
More examples:
A | B | |
1 | …. | |
3 | =A2.groups(State,Gender;avg(Salary),count(1)) | Group & aggregate by multiple fields |
45 | =A1.new(Name,Gender,Dept,Orders.OrderID,Orders.Client,Orders.Client,Orders.SellerId,Orders.Amount,Orders.OrderDate) | Association |
6 | =A1.sort(Salary) | Sort |
7 | =A1.id(State) | Distinct |
8 | =A2.top(-3;Amount) | Get topN |
9 | =A2.groups(Client;top(3,Amount)) | Get TopN in each group (window function) |
SPL offers rich date and string functions to increase development efficiency
SPL supplies a large number of date functions and string functions, far more than JSONPath/XPath and even SQL and much more powerful than them. This enables SPL to implement same computations with significantly shorter code. For example:
Time functions:
Get the date before or after a specified date: elapse("2020-02-27",5) //Return 2020-03-03
Find the day of the week of a specified date: day@w("2020-02-27") //Return 5, which is Thursday
Get the date N workdays after: workday(date("2022-01-01"),25) //Return 2022-02-04
String functions:
Check whether a string all consists of letters: isdigit("12345") //Return true
Get a string before a specified substring: substr@l("abCDcdef","cd") //Return abCD
Split a string into an array of substrings by vertical bar: "aa|bb|cc".split("|") //Return ["aa","bb","cc"]
SPL also offers functions to get a date before or after a number of years, get the ordinal number of a date in the year, get which quarter the date belongs to, split a string according to a regular expression, get the where or select part of a SQL statement, get words from a string, split HTML by the specific marker, etc.
SPL does not store script and application together, supports hot-swap and computes data coming from different sources in a uniform way. All these features help to achieve the optimal framework.
SPL offers JDBC driver, stores script separately and supports hot swap
We can save the above SPL code as a script file and invoke the file name in a Java program in the way of calling a stored procedure:
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("call groupBy()");
Placing the script file outside a Java program separates computing logic from application and thus effectively reduces system coupling.
SPL, as an interpreted language, can achieve hot-swappable code by executing in real-time after any changes without restarting the Java application. The ability ensures system stability and makes maintenance convenient.
SPL supports diverse data sources and computes hierarchical data using consistent syntax
Besides regular files, SPL supports hierarchical files originated from WebService and RESTful, too. To read multilevel XML data from WebService and perform a conditional query, for instance:
A | |
1 | =ws_client("http://127.0.0.1:6868/ws/RQWebService.asmx?wsdl") |
2 | =ws_call(A1,"RQWebService":"RQWebServiceSoap":"getEmp_orders") |
3 | =A2.conj(Orders) |
4 | =A3.select(Amount>1000 && Amount<=2000 && like@c(Client,"\*business\*")) |
Similarly, to read multilevel JSON data from RESTful and perform the same conditional query:
A | |
1 | =httpfile("http://127.0.0.1:6868/restful/emp_orders").read() |
2 | =json(A1) |
3 | =A2.conj(Orders) |
4 | =A3.select(Amount>1000 && Amount<=2000 && like@c(Client,"\*business\*")) |
In addition to WebService and RESTful, many other data sources return hierarchical data, such as MongoDB, ElasticSearch and Salesforce. SPL supports them all and can directly retrieve data from them for further computation.
To retrieve multilevel JSON data from MongoDB and perform a conditional query, for instance:
A | |
1 | =mongo_open("mongodb://127.0.0.1:27017/mongo") |
2 | =mongo_shell@x(A1,"data.find()") |
3 | =A2.conj(Orders) |
4 | =A3.select(Amount>1000 && Amount<=2000 && like@c(Client,"\*business\*")) |
SPL also supports databases, file source like txt, csv and xls, and NoSQL databases like Hadoop, Redis, Kafka and Cassandra.
No matter what source data originates from, SPL consistently reads it as a table sequence, enabling the uniform way of computing hierarchical data. The consistent and uniform syntax makes SPL code easy to migrate.
SPL has convenient built-in function syntax, which is suitable to compute complicated hierarchical data and helps to simplify complex business logics. The language, actually, outperforms SQL in data processing.
Convenient built-in function syntax powers SPL with excellent computational capability
SPL offers unique function option syntax that allows functions with similar functionalities to use same name and distinguishes them with different options. The basic functionality of select function is to filter data. If we need to get the first eligible record, we use @1 option:
Orders.select@1(Amount>1000)
When the size of data to be filtered is large, we use @m option to implement parallel processing in order to increase performance:
Orders.select@m(Amount>1000)
We use @b option to perform a quick filtering using the binary search if the original data is ordered:
Orders.select@b(Amount>1000)
Function options can work together:
Orders.select@1b(Amount>1000)
Usually, parameters in a structured computation function are complicated. SQL, for instance, uses a lot of keywords to divide parameters of a statement into multiple groups, causing inconsistent statement structure.
SPL has hierarchical parameters. It employs semicolon, comma and colon to identify parameters to three levels, and writes complicated parameters in a general and simple way. For example:
join(Orders:o,SellerId ; Employees:e,EId)
The expressive SPL is fit for computing complicated hierarchical data
Here’s one instance. RESTful returns complicated multilevel JSON data that contains multiple subdocuments. The data is like this:
[
{
"race": {
"raceId":"1.33.1141109.2",
"meetingId":"1.33.1141109"
},
...
"numberOfRunners": 2,
"runners": [
{ "horseId":"1.00387464",
"trainer": {
"trainerId":"1.00034060"
},
"ownerColours":"Maroon,pink,dark blue."
},
{ "horseId":"1.00373620",
"trainer": {
"trainerId":"1.00010997"
},
"ownerColours":"Black,Maroon,green,pink."
}
]
},
...
]
We are trying to perform grouping & aggregation on different levels of data (group data by trainerID and count members of ownerColours in each group). It is hard to handle the computation using the familiar way, but it becomes easy to handle in SPL:
A | |
1 | … |
2 | =A1(1).runners |
3 | =A2.groups(trainer.trainerId; ownerColours.array().count():times) |
The powerful SPL simplifies complex business logics
SPL can handle computing scenarios with complex logics like stepwise computations, order-based computations and computations after data is grouped. It is easy and simple to handle many scenarios that SQL/stored procedures find hard to handle. To find the first n big customers whose orders amount takes up at least half of the total and sort them by amount in descending order:
A | B | |
1 | … | /Retrieve data |
2 | =A1.sort(amount:-1) | /Sort orders amount in descending order |
3 | =A2.cumulate(amount) | /Get the sequence of cumulative amounts |
4 | =A3.m(-1)/2 | /Calculate the final cumulative amount, i.e. the total |
5 | =A3.pselect(~>=A4) | /Find the position where the cumulative amount exceeds half of the total |
6 | =A2(to(A5)) | /Get target values by positions |
From hardcoding to JSONPth/XPath, JSON and XML processing technologies achieved groundbreaking development. And from JSONPth/XPath to SPL, we have more powerful ability to compute hierarchical data. With specialized built-in data objects and a wealth of computing functions, string functions and date functions, SPL possesses more than enough computational capability. The language supports storing script outside the application, uses the uniform way to compute data coming from various sources, and thus achieves the optimal application structure. It encapsulates convenient to use function syntax that is fit for processing complicated hierarchical data and streamlines the conventional complex business logics.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code