-
Notifications
You must be signed in to change notification settings - Fork 340
Computing engine on open format files
The files of open formats like txt\csv\json\xml\xls are commonly used in our daily work, and need to be calculated and processed sometimes. There seem to be many tools that can calculate and process such files, but in reality, these tools all have their own shortcomings. Specifically, the libraries like OpenCSV\JsonPath are good at parsing but weak in computing; spark is relatively strong in computing, but too heavy in framework and steep in learning curve; embedded databases such as SQLite\HSQLDB have light framework, yet the process of loading data into database is long and complicated, and delay would be fatal; Other tools are either immature, or have long code, or are complicated to configure, the details will not be given here.
To calculate the files of open formats, there is a better choice: esProc SPL.
esProc SPL, as an JVM-based and open-source programming language, provides the parsing functions for many types of files of open formats. In addition, esProc SPL boasts powerful computing, writing, and application integration abilities, and is able to calculate the files of different formats using consistent data structure and coding.
The text files of regular formats are similar to a data table (two-dimensional structure), with the first row being the column names, each subsequent row being one record, and the columns being separated by a fixed symbol. In these texts, the comma-separated csv file and the tab-separated txt file are two most commonly seen formats. SPL provides T function to parse text files using just one line of code.
A1=T("D:\\data\\Orders.csv")
After the text in external storage is read into memory, SPL stores the data as a table sequence (two-dimensional structured data object). As a basic data type, the table sequence is provided with rich access syntax and processing functions:
Get the 3rd record after parsing: A1(3)
Get the last 3 records: A1.m([-1,-2,-3])
Get the field value of a record: A1(3).Amount*0.05
Modify the field value of a record: A1(3).Amount = A1(3). Amount*1.05
Get one column and return it as a set: A1.(Amount)
Get a few columns and return them as a set of sets: A1.([CLIENT,AMOUNT])
Get a value first by field and then by record number: A1.(AMOUNT)(2)
Get a value first by record number and then by field: A1(2).AMOUNT
Append a record: A1.insert(200,"APPL",10,2400.4,date("2010-10-10"))
After the table sequence is processed and calculated, we can use the export function to save it as regular text file with ease. For example, after the table sequence A1 is appended with record, save it as csv file with column name:
file("D:\\data\\result.csv").export@tc(A1)
We can specify the separator when writing the file:
file("D:\\data\\result.txt").export@t(A1; ":")
For the parsed table sequence, SPL offers rich computing functions, making it possible to easily accomplish everyday SQL-style calculations.
Filter: s.select(Amount>1000 && Amount<=3000 && like(Client,"*s*"))
Sort: s.sort(Client,-Amount)
Distinct: s.id(Client)
Group and aggregate: s.groups(year(OrderDate);sum(Amount))
Associate: join(T ("D:/data/Orders.csv"):O,SellerId; T("D:/data/Employees.txt"):E,EId)
TopN: s.top(-3;Amount)
In-group TopN: s.groups(Client;top(3,Amount))
In addition, SPL provides the syntax that conforms to SQL92 standard, and supports the set calculation, case when, with, nested subqueries, etc.
SPL offers powerful import function, which can parse the text files of irregular formats, including the file that contains special separators or special date formats, the file whose first row doesn’t contain column names, and the file with quotes stripped, or blanks removed, or data type specified, etc. For example, parse a text file separated by double-dash:
s=file("D:/Orders.txt").import@t(;,"--")
In addition to irregular-format text files, there are also irregular-content text files, which usually cannot be directly parsed into two-dimensional structured data. To cope with this problem, SPL provides flexible functional syntax to obtain desired data through simple handling. Let’s take a text file as an example, each three rows form one record, and the second row contains multiple fields, now we want to rearrange the file to transform it to a two-dimensional structured data sorted by the 3rd and 4th fields:
A | |
---|---|
1 | =file("D:\data.txt").import@si() |
2 | =A1.group((#-1)\3) |
3 | =A2.new(~(1):OrderID, (line=~(2).array("\t"))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate ) |
4 | =A3.sort(_3,_4) |
SPL has excellent computing ability and can easily handle the ordered operation, set-oriented operation, association calculation, and stepwise calculation that are difficult to handle in SQL and stored procedures. For example, the following SPL code is to calculate the maximum consecutive days that a stock keeps rising:
A | |
---|---|
1 | //File parsing |
2 | =a=0,A1.max(a=if(price>price[-1],a+1,0)) |
Another example, find out the top n customers whose cumulative sales account for at least half of the total sales, and sort them by sales in descending order:
A | B | |
---|---|---|
1 | //File parsing | |
2 | =A1.sort(amount:-1) | /Sort by sales in descending order |
3 | =A2.cumulate(amount) | /Get a sequence of cumulative amounts |
4 | =A3.m(-1)/2 | /Calculate the final accumulative amount, i.e., the total |
5 | =A3.pselect(~>=A4) | /Find the position where the amount exceeds half of the total |
6 | =A2(to(A5)) | /Get target values by position |
SPL provides a wealth of date and string functions, making it possible to effectively simplify related calculations.
Get the date before or after a specified number of quarters: elapse@q("2020-02-27",-3) // Return 2019-05-27
Get the date after N workdays: workday(date("2022-01-01"),25) // Return 2022-02-04
String functions: check if a string consists entirely of numbers: isdigit("12345") // Return true
Get a string before a specified substring: substr@l("abCDcdef","cd") // Return abCD
Split a string into an array of substrings by vertical bar: "aa|bb|cc".split("|") // Return ["aa","bb","cc"]
SPL also offers many other functions, for example, get a date before or after specified number of years, get which quarter the date belongs to, split a string according to a regular expression, get words from a string, split HTML by the specific marker, etc.
It is worth mentioning that in order to further improve development efficiency, SPL invents unique function syntax. For example, we can use option to distinguish similar functions. If we only want to filter out the first record that meets condition, we can use the option @1:
T.select@1(Amount>1000)
If we want to search for the first record from back to front, we can use the option @z:
T.select@z1(Amount>1000)
For large files whose size exceeds memory capacity, SPL provides convenient methods to parse, compute and write them, and related functions are elaborately encapsulated, and named and used similarly to functions for small files, and the learning curve is smoother. For example, read a large text, filter by parameter, and write to a new file:
A | |
---|---|
1 | =file("D:\sales.txt").cursor@t() |
2 | =A1.select(OrderDate>=P_startDate && OrderDate<=P_endDate) |
3 | =file("D:\sales.txt").export@t(A1) |
The function cursor is used to read the large file, and its usage is similar to function import, except that it generates a cursor type, and encapsulates the exchange details between the memory and the external storage, that is, read a part of the data from external storage to memory, calculate in memory, and accumulate the calculation results, and continue to read data from external storage. The calculation object of the computing function select and the writing function export is a cursor, which is essentially different from the table sequence of small file, yet the encapsulated function name is the same (and the parameter is the same) in order to reduce learning cost.
Like the table sequence, the cursor also supports calculations involving large data volume such as sorting, grouping and aggregating, association, and set-oriented calculation.
SPL supports multi-thread parallel computing, which can improve the speed of processing large file. For example, change the above example to parallel computing, we only need to modify A1 code to:
=file("D:\\sales.txt").cursor@m()
The option @m means multi-thread parallel computing. By default, the number of threads is the one in the configuration file, or can be specified in real time. This code generates parallel multiple cursors, and read file at the same time, and subsequent calculations will also be executed in parallel.
For ordered large files, SPL can further improve computing performance.
SPL provides JDBC driver, through which it can be easily integrated into Java code: Simple SPL code can be embedded directly in Java like SQL:
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
String str="=T(\"D:/Orders.xls\").select(Amount>1000 && Amount<=3000 && like(Client,\"*s*\"))";
ResultSet result = statement.executeQuery(str);
For complex SPL code, we can save it as a script file first, and then invoke it by a Java program in the way of invoking a stored procedure, which can effectively reduce the coupling between computing code and front-end application.
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("{call scriptFileName(?, ?)}");
statement.setObject(1, "2020-01-01");
statement.setObject(2, "2020-01-31");
statement.execute();
SPL is an interpreted language, and the outside-application SPL code can be executed without compiling. In addition, SPL supports non-stop hot deployment, this makes it suitable for the changing business logic, and makes O&M complexity low.
SPL provides a wealth of parsing functions, and supports many types of open-format files. Such files can all be parsed as the same data object “table sequence”, which can be calculated with consistent functions and syntax, and the code remains unchanged.
POI is a stable and mature xls file parsing library. Due to its too low-level functionalities, it is cumbersome to code. SPL encapsulates POI’s functionalities, allowing us to read/write various xls files through greatly simplified functions. To read the row-wise xls files of regular formats as table sequence, we can still use SPL T function:
=T("d:\\Orders.xls")
To generate a row-wise xls file of regular format from a table sequence, we can use the xlsexport function. For example, write A1 to the first sheet of a new xls file where the first row contains column names:
=file("e:/result.xlsx").xlsexport@t(A1)
The xlsexport function is versatile. It can write a table sequence to a specified sheet, or write part of rows or columns of a table sequence to the sheet. It is convenient to append data using xlsexport function. For example, there is an xls file having data in it, and we want to append the table sequence A1 to the end of the file. The appended data will use the same appearance as the last row of the file:
=file("e:/scores.xlsx").xlsexport@a(A1)
For the row-wise xls file of irregular formats, SPL provides xlsimport function with rich and concise built-in read functionalities.
Import an Excel file by skipping the title in the first two rows: file("D:/Orders.xlsx").xlsimport@t(;,3)
Import a sheet named "sales": file("D:/Orders.xlsx").xlsimport@t(;"sales")
The xlsimport function also have other functionalities like reading N rows backwards, opening an xls file using password and reading a large xls file.
For xls files of extremely irregular formats, SPL provides xlscell function, which can read/write the data of a specified range of a given sheet.
Unlike the two-dimensional data of regular formats such as text and xls files, Json and XML are hierarchical data that cannot be handled by general computing engines. SPL table sequence is carefully designed, and supports both the two-dimensional data and hierarchical data (the former is a special case of the latter).
For example, the following code is to parse a Json file and perform conditional query:
A | B | |
---|---|---|
1 | =file("d:\xml\emp_orders.json").read() | Read Json string |
2 | =json(A1) | Parse it as SPL table sequence |
3 | =A2.conj(Orders) | Merge lower-layer records |
4 | =A3.select(Amount>1000 && Amount<=2000 && like@c(Client,"*business*")) | Conditional query |
Click cell A2, we can see a hierarchical table sequence structure, where Eid field and State field store the data of simple types, and Orders field stores the set of records (two-dimensional tables). Click a row under Orders, we can expand the value to view details:
No matter what file format it is, we can use same code to compute as long as the file is parsed as a table sequence. For example, read XML string from a file (same structure as the previous Jason file), and perform conditional query. We only need to modify the first two lines of code:
A | B | |
---|---|---|
1 | =file("d:\xml\emp_orders.xml").read() | Read XML string |
2 | =xml(A1,"xml/row") | Parse it as SPL hierarchical table sequence |
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code