-
Notifications
You must be signed in to change notification settings - Fork 339
Handling Excel Files in Java:From POI to SPL
Parsing, generating, querying and computing Excel files are common tasks Java often handles. As Excel has complicated file format, it is hard to read /write Excel files by writing your own code. But this is made convenient by various class libraries like POI, EasyExcel and JExcel, where POI is the best.
POI can read/write Excel files of various formats, from the older binary format (xls) to the current OOXML (xlsx) format. It supports both large memory jobs by reading/writing the whole Excel file at once and small memory jobs by reading/writing the file in a stream style.POI designs Java classes for each of the many Excel elements, including workbook, printer, sheet, row and cell. Cell-related classes involve cell type, font style, color, date, alignment and border style, etc. There are more than forty methods in the cell style class for performing the most comprehensive read/write operations.
Yet getting down to the tiniest detail creates low-level functionalities and, by taking care of every detail themselves, developers write programs totally from scratch. Even a very simple operation needs a lot of code. To read a row-wise xls file where the first row contains column headers, for instance:
FileInputStream fileInputStream = new FileInputStream("d:\\Orders.xls");
// get the excel book
Workbook workbook = new HSSFWorkbook(fileInputStream);
if (workbook != null) {
// get the first sheet
Sheet sheet = workbook.getSheetAt(0);
if (sheet != null) {
//get the col name/first line
Row rowTitle = sheet.getRow(0); // first line
if (rowTitle != null) {
int cellTitles = rowTitle.getPhysicalNumberOfCells(); // get column number
for (int i = 0; i < cellTitles; i++) {
Cell cell = rowTitle.getCell(i); //the cell!
if (cell != null) {
System.out.print(cell.getStringCellValue() + " | ");
}
}
}
//get the value/other lines
int rows = sheet.getPhysicalNumberOfRows(); // get line number
for (int i = 1; i < rows; i++) {
Row row = sheet.getRow(i); // get row i
if (row != null) {
int cells = row.getPhysicalNumberOfCells(); // get column number
for (int j = 0; j < cells; j++) {
// line number and row number
System.out.print("[" + i + "-" + j + "]");
Cell cell = row.getCell(j); // the cell!
if (cell != null) {
int cellType = cell.getCellType();
Object value = "";
switch (cellType) {
case HSSFCell.CELL_TYPE_STRING: // string
value = cell.getStringCellValue();
break;
case HSSFCell.CELL_TYPE_BLANK: // 空
break;
case HSSFCell.CELL_TYPE_BOOLEAN: // boolean
value = cell.getBooleanCellValue();
break;
case HSSFCell.CELL_TYPE_NUMERIC: // number
if (HSSFDateUtil.isCellDateFormatted(cell)) { // date number
Date date = cell.getDateCellValue();
value = new DateTime(date).toString("yyyy-MM-dd HH:mm:ss");
}else { // normal number
// change to string to avoid being too long
cell.setCellType(HSSFCell.CELL_TYPE_STRING);
value = cell;
}
break;
case HSSFCell.CELL_TYPE_ERROR:
throw new RuntimeException("data type mistaken");
}
System.out.println(value);
}
}
}
System.out.println("end of the "+i+" line");
}
System.out.println("end of the value lines=======================================");
}
}
Row-wise xls files are the most commonly seen Excel format, but POI does not offer any convenient methods to handle them. The only way is to parse a file by loop in the order of workbook->sheet->line->cell, which generates extremely complicated code as the above shows.
The code is already so complicated for simply reading data from an xls file. To further process the Excel data, we need to first transform it to a structured data object like ArrayList<entity class> or HashMap and the code will be even more complex.
Parsing files is not the target. The real need is to query and compute them. POI, as the Excel parsing class, does not and is not suitable for providing relevant methods. As a result, programmers have to resort to manual hardcoding in Java. To achieve a basic grouping & aggregation operation, for instance, Java has code like this:
Comparator<salesRecord> comparator = new Comparator<salesRecord>() {
public int compare(salesRecord s1, salesRecord s2) {
if (!s1.salesman.equals(s2.salesman)) {
return s1.salesman.compareTo(s2.salesman);
} else {
return s1.ID.compareTo(s2.ID);
}
}
};
Collections.sort(sales, comparator);
ArrayList<resultRecord> result=new ArrayList<resultRecord>();
salesRecord standard=sales.get(0);
float sumValue=standard.value;
for(int i = 1;i < sales.size(); i ++){
salesRecord rd=sales.get(i);
if(rd.salesman.equals(standard.salesman)){
sumValue=sumValue+rd.value;
}else{
result.add(new resultRecord(standard.salesman,sumValue));
standard=rd;
sumValue=standard.value;
}
}
result.add(new resultRecord(standard.salesman,sumValue));
Coding the computation in Java is not only complicated but has framework fault. The code is not reusable. There is a tight coupling between data structure and computing code. If data structure is changed, the code needs to be rewritten. Querying needs are volatile, and Java, as a compiled language, requires restarting the application for each modification of the code. This involves heavy workload and leads to system instability.
Mature and stable as POI is, it has too low-level read/write functionalities and does not offer querying and computing ability. It is inefficient to process Excel files (particularly to query and compute them) directly based on POI. If we can encapsulate POI to get a simple and easy to use high-level read/write function and provide additional querying and computing ability, the efficiency will be considerably increased.
esProc SPL has done excellently well in achieving that.
SPL is an open-source computing engine under JVM. It encapsulates POI to offer built-in simple and easy to use high-level functions that parses/generates various regular- or irregular-format Excel files and automatically generates a structured data object.
SPL offers T function for parsing row-wise Excel files of regular formats . To parse the previously mentioned xls file, for instance, POI needs dozens of lines of code but SPL with encapsulated POI only needs a single line:
=T("d:\Orders.xls")
Parsing row-wise Excel files is a common task. SPL T function encapsulates POI’s functionalities in it and has simple and easy to use interface for parsing both xls and xlsx files. It provides automatic type conversion functionality, saving troubles in taking care of details. The function automatically distinguishes column headers in the first row and detailed data in other rows, creates a table sequence (SPL structured data object) according to column headers and populates data to it:
After SPL reads and parses data as a table sequence, it has a wealth of structured data computing methods to process it:
Get the 3rdrecord: A1(3)
Get the last three records backwards: A1.m([-1,-2,-3])
Calculate a field value in a specified record: A1(3).Amount*0.05
Modify a field value in a specified record: A1(3).Amount = A1(3). Amount*1.05
Get a column and return it as a set: A1.(Amount)
Get multiple columns and return them as a set of sets: A1.([CLIENT,AMOUNT])
Append a record: A1.insert(200,"APPL",10,2400.4,date("2010-10-10"))
Get a value first by field and then by record number: A1.(AMOUNT)(2),which is equivalent to getting it first by record number and then by field A1(2).AMOUNT
SPL offers xlsimport function to parse row-wise xls filesof relatively irregular formats with rich and simplistic built-in read/write functionalities:
Import an Excel file without column headers and detailed data starts from the first row: file("D:\Orders.xlsx").xlsimport()
Import an Excel file by skipping the title in the first two row: file("D:/Orders.xlsx").xlsimport@t(;,3)
Import Excel data from the 3rdrow to the 10throw: file("D:/Orders.xlsx").xlsimport@t(;,3:10)
Import 3 columns of Excel data: file("D:/Orders.xlsx").xlsimport@t(OrderID,Amount,OrderDate)
Import a sheet named "sales":file("D:/Orders.xlsx").xlsimport@t(;"sales")
The xlsimport function also have other functionalities like reading N rows backwards, opening an xls file using password and reading a large xls file.
SPL uses xlscell function to parse xls files with extremely irregular formats . It can read/write data in a specified range of a given sheet. To read cell A2 in sheet1, for instance:
=file("d:/Orders.xlsx").xlsopen().xlscell("C2")
SPL is capable of parsing free-format xls files with its agile syntax. One instance is to parse the following file as a standard two-dimensional table (table sequence).
The file has a very irregular format. Writing Java code with POI will be a heavy and time-consuming job, but SPL code is short and concise:
A | B | C | |
1 | =create(ID,Name,Sex,Position,Birthday,Phone,Address,PostCode) | ||
2 | =file("e:/excel/employe.xlsx").xlsopen() | ||
3 | [C,C,F,C,C,D,C,C] | [1,2,2,3,4,5,7,8] | |
4 | for | =A3.(~/B3(#)).(A2.xlscell(~)) | |
5 | if len(B4(1))==0 | break | |
6 | >A1.record(B4) | ||
7 | >B3=B3.(~+9) |
SPL has xlsexport function to generate regular-format row-wise xls files . It is simple to use. For example, result of parsing the previous file is a table sequence and stored in SPL’s cell A1, and we want to write A1 to the first sheet of a new xls file where the first row contains column headers. SPL only needs one line of code: =file("e:/result.xlsx").xlsexport@t(A1).
The xlsexport function is versatile. It can write a table sequence to a specified sheet, or its certain rows or columns to the sheet:
=file("e:/scores.xlsx").xlsexport@t(A1,No,Name,Class,Maths)
It is convenient to append data using xlsexport function. Suppose there is an xls file having data in it, and we are trying to append table sequence A1 to the end of the file. The appended data will use the same appearance as the last row of the file:
=file("e:/scores.xlsx").xlsexport@a(A1)
The above-mentioned xlscell function canbe used to write data to an irregular-format range . For example, the blue cells in the following xls file contain irregular table headers and we need to fill data in the corresponding blank cells:
The POI code will be bloated and lengthy. SPL code, as shown below, is short and concise:
A | B | C | D | E | F | |
1 | Mengniu Funds | 2017 | 3 | 58.2 | 364 | 300 |
2 | 8.5 | 50 | 200 | 100 | 400 | 200 |
3 | 182.6 | 76.3 | 43.7 | 28.5 | 16.4 | |
4 | 120 | 1.07 | 30 | 0.27 | 90 | 0.8 |
5 | 154 | 6 | 4 | |||
6 | =file("e:/result.xlsx") | =A6.xlsopen() | ||||
7 | =C6.xlscell("B2",1;A1) | =C6.xlscell("J2",1;B1) | =C6.xlscell("L2",1;C1) | |||
8 | =C6.xlscell("B3",1;D1) | =C6.xlscell("G3",1;E1) | =C6.xlscell("K3",1;F1) | |||
9 | =C6.xlscell("B6",1;[A2:F2].concat("\t")) | =C6.xlscell("H6",1;[A3:E3].concat("\t")) | ||||
10 | =C6.xlscell("B9",1;[A4:F4].concat("\t")) | =C6.xlscell("B11",1;[A5:C5].concat("\t")) | ||||
11 | =A6.xlswrite(B6) |
Note that row6, row9 and row11 have continuous cells where SPL condenses code to fill them together. POI, however, can only operate cell by cell.
Data querying and computing forms the major part of Excel processing. SPL supplies a great number of computing functions, string functions and date functions, as well as standard SQL syntax to deal with not only daily Excel processing but processing of irregular-format and complex-logic Excel files.
SPL boasts rich computing functions for performing basic computations directly. It uses only one line of code to accomplish the previous grouping & aggregation computation:
A1.groups(SellerId;sum(Amount))
It handles more computations conveniently:
Conditional query: A1.select(Amount>1000 && Amount<=3000 && like(Client,"S"))
Sort: A1.sort(Client,-Amount)"
Distinct: A1.id(Client)"
Join two xlsx files:
join(T("D:/Orders.xlsx"):O,SellerId; T("D:/Employees.xls"):E,EId).new(O.OrderID,O.Client,O.SellerId,O.Amount,O.OrderDate, E.Name,E.Gender,E.Dept)"
Get TopN:T("D:/Orders.xls").top(-3;Amount)
Get TopN in each group (analytic function): T("D:/Orders.xls").groups(Client;top(3,Amount))
SPL offers so many date functions and string functions that it can generate short code and achieve high efficiency: For example:
Time functions:
Get the date before or after a specified date: elapse("2020-02-27",5) //Return 2020-03-03
Find the day of the week of a specified date: day@w("2020-02-27") // Return 5, which is Thursday
Get the date N workdays after: workday(date("2022-01-01"),25) //Return 2022-02-04
String functions:
Check whether a string all consists of letters: isdigit("12345") //Return true
Get a string before a specified substring: substr@l("abCDcdef","cd") /ReturnabCD
Split a string into an array of substrings by vertical bar: "aa|bb|cc".split("|") //Return ["aa","bb","cc"]
SPL also offers functions to get a date before or after a number of years, get the ordinal number of a date in the year, get which quarter the date belongs to, split a string according to a regular expression, get the where or select part of a SQL statement, get words from a string, split HTML by the specific marker, etc.
SPL supports standard SQL syntax to query xls files directly as it queries database tables. This significantly reduces costs of learning for programmers:
filter:$select * from d:/sOrder.xlsx where Client like '%S%' or (Amount>1000 and Amount<=2000)
sort:$select * from sales.xls order by Client,Amont desc
distinct:$ select distinct(sellerid) from sales.xls
group by…having:$select year(orderdate) y,sum(amount) s from sales.xls group by year(orderdate) having sum(amount)>=2000000
join:$select e.name, s.orderdate, s.amount from sales.xls s left join employee.xlsx e on s.sellerid= e.eid
SPL supports most of the syntax in SQL-92 standard, such as set-oriented operations, case when, with and nested query. Learn more about SPL simple SQL in Open-source SPL that can execute SQL without RDB.
Irregular-format xls files are hard to be handled with general class libraries . But SPL can handle them effortlessly and smoothly with its agile syntax and rich functions. Suppose Excel cells contain many strings of “key=value” format and we need to rearrange it into a standard two-dimensional table for further processing, SPL has the following code:
A | |
1 | =file("D:/data/keyvalue.xlsx").xlsimport@w() |
2 | =A1.conj().(~.split("=")) |
3 | =A2.new(~(1),~(2)) |
Complex-logic computations are another SQL (stored procedure) difficulty . SPL has excellent computational capability. It can handle computations with complex logics conveniently and easily. To count the largest number of consecutively rising days for a stock, for instance:
A | |
1 | =T("d:/AAPL.xlsx") |
2 | =a=0,A1.max(a=if(price>price[-1],a+1,0)) |
SPL is an interpreted language. It offers JDBC driver to be easily integrated into a Java program in the way of invoking the SQL or a stored procedure. This reduces framework coupling and enables support of hot swap. SPL also supports diverse data sources, as well as cross-data-source computations.
SPL supplies JDBC driver , through which it can be easily invoked by a Java program. As SQL, simple SPL code can be directly embedded into Java code. To perform a conditional query, for instance:
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
String str="=T(\"D:/Orders.xls\").select(Amount>1000 && Amount<=3000 && like(Client,\"*S*\"))";
ResultSet result = statement.executeQuery(str);
SPL’s strategy of placing the computing logic outside a Java program reduces coupling between computing code and the front-end application . The more complex SPL code can be first saved as a script file and then invoked by a Java program in the way of invoking a stored procedure:
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("{call scriptFileName(?, ?)}");
statement.setObject(1, "2020-01-01");
statement.setObject(2, "2020-01-31");
statement.execute();
SPL, as an interpreted language, can achieve hot-swappable code by storing it separately from the Java program . The SPL code can execute in real-time after any changes without restarting the Java application, which reduces maintenance workload and ensures system stability.
SPL supports various file sources , including xls, csv, txt, XML and JSON, etc. To perform a conditional query on a txt file, for instance:
T("sOrders.txt").groups(SellerId;sum(Amount))
$select * from d:/sOrders.txt where Client like '%S%' or (Amount>1000 and Amount<=2000)
SPL support cross-data-source computations . To perform a join between an xls file and a txt file, for instane:
=join(T("D:/Orders.xlsx"):O,SellerId; T("D:/Employees.txt"):E,EId).new(O.OrderID,O.Client,O.SellerId,O.Amount,O.OrderDate, E.Name,E.Gender,E.Dept)"
SPL can access all types of RDBs, network services like WebService and RESTful, and NoSQL databases like Hadoop, Redis, Kafka and Cassandra.
In summary, POI is only fit for handling simple Excel parsing and generation tasks and it does not provide any querying and computing ability. SPL encapsulates POI’s functionalities to transform them into its built-in high-level read/write functions. This greatly simplifies traditionally complicated code for data read/write, and enables parsing/generation of Excel files with irregular and free-style formats. Additionally, SPL offers outstanding computational capability, capable of dealing with both daily and irregular-format and complex-logic Excel querying. SPL supports the optimal framework that enables loose coupling and support of hot swap, diverse data sources and cross-data-source computations.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code