-
Notifications
You must be signed in to change notification settings - Fork 340
Looking for the Best Programming Language for Processing Excel Files
As a data analyst, you have a lot of opportunities to process Excel files. At times you need to do the processing using a program. There are several scripting languages to choose. Here we compare them in a number of aspects, such as if they are easy to use when parsing an Excel file, how adaptable they are with different data format, to know the pros and cons of each. The best ones are Python pandas and esProc SPL. Yet the latter is even better. Looking ${article} for details.
Excel files are common for data analysis. A program is needed sometimes to do the analysis. This requires that a programming language be able to parse Excel files conveniently.
Programming languages that can parse and process Excel files:
- General-purpose high-level programming languages, such as Java;
- Excel VBA;
- Python;
- esProc SPL.
Now let’s look at how these languages use their own ways to do the parsing and processing. Here my focus is how they read in an Excel file as the structured data. Only brief explanation about handling the subsequent processing, writing data to the database or data manipulation according to specific requirements,
Almost all high-level languages can read Excel files. The point is that if there isn’t a third-party professional data retrieval API, you need to write your own program to read data according to the target Excel file’s structure. That involves a lot of work and effort. Fortunately Java has Apache POI to read and write Excel files. The API can read values and properties of each cell. Now let’s look at how it read Excel spreadsheets into structured data.
Here’s a simple Excel file. The first row is column headers. All the rest of the rows contain detailed data rows. The file is as follows:
To use Java to call POI to read the data, we have the following program:
DataSet ds = null; //A user-defined class that stores data read from the Excel file;
HSSFWorkbook wb = new HSSFWorkbook( new FileInputStream( "simple.xls" ) );
HSSFSheet sheet = wb.getSheetAt( 0 ); //Assume the to-be-read data is stored in the first sheet
int rows = sheet.getLastRowNum();
int cols = sheet.getRow(0).getLastCellNum();
ds = new DataSet( rows, cols );
for( int row = 0; row <= rows; row++ ) {
HSSFRow r = sheet.getRow( row );
for( int col = 0; col <= cols; col++ ) {
HSSFCell cell = r.getCell( col );
int type = cell.getCellType();
Object cellValue; //Cell value object
switch( type ) { //Switch cell value to corresponding Java object according to its type
case HSSFCell.CELL_TYPE_STRING:
......
case HSSFCell.CELL_TYPE_NUMERIC:
......
......
// The cell value handling code is omitted
}
if( row == 0 ) ds.setColTitle( col, (String)cellValue );
else ds.setCellValue( row, col, cellValue );
//Set cell values of as column headers if they are in the 1st row, otherwise set them as a data set
}
}
The program can only read in an Excel file having the simplest format. It is quite long even the cell value handling part is omitted. It will be longer and more complicated if the Excel file has complex format, such as merged cells, complicated multi-row table headers & footers, multi-row records and crosstab.
Even with such a powerful open-source package as POI, it’s still rather complicated to parse an Excel file in Java.
Moreover, high-level languages provide only low-level functions and lack special functions for structured data computations like the filtering, sorting, grouping & aggregation and join over data sets. Programmers need to write theirs specifically. So there is still a lot of work to do even after data is read and parsed.
VBA (Visual Basic for Applications) is the Visual Basic-based programming language that you can use to create a macro and extend Windows Office applications, including Word, Excel, Access, etc. It is embedded in Excel to increase the latter’s flexibility and data processing capability. It can directly get data in Excel cells. The natural parsing ability makes it more convenient than high-level languages like Java. But, it also lacks structured computation functions. That means you still need to write lots of code for the computations after data is retrieved.
Below is a snippet of code for doing grouping & aggregation (group column A and aggregate column B in sheet1):
Public Sub test()
Dim Arr
Dim MyRng As Range
Dim i As Long
Dim Dic As Object
Set MyRng = Range("A1").CurrentRegion
Set MyRng = MyRng.Offset(1).Resize(MyRng.Rows.Count - 1, 2)
Set Dic = CreateObject("Scripting.dictionary")
Arr = MyRng
For i = 1 To UBound(Arr)
If Not Dic.exists(Arr(i, 1)) Then
Dic.Add Arr(i, 1), Arr(i, 2)
Else
Dic.Item(Arr(i, 1)) = Dic.Item(Arr(i, 1)) + Arr(i, 2)
End If
Next i
Sheet2.Range("A1") = "subject"
Sheet2.Range("A2").Resize(Dic.Count) = Application.WorksheetFunction.Transpose(Dic.keys)
Sheet2.Range("B1") = "subtotal"
Sheet2.Range("B2").Resize(Dic.Count) = Application.WorksheetFunction.Transpose(Dic.items)
Set Dic = Nothing
End Sub
Parsing an Excel file is not our aim. It just serves the subsequent computations and manipulations. Though VBA has the intrinsic ability of parsing Excel data, it is not as good as processing it.
Python pandas offers the interface of reading Excel files. We can read the previous simple format Excel file in Python with the following code:
import pandas as pd
file = 'simple.xls'
data = pd.read_excel(file,sheet_name='Sheet1',header=0)
Setting parameter header as 0 (header=0) means reading the first row as the column headers. “data” is the retrieved structured data set.
To read an Excel spreadsheet with complicated table headers (as shown below):
The Python program is as follows:
import pandas as pd
file = 'complex.xls'
data = pd.read_excel(file,sheet_name='Sheet1',header=None,skiprows=[0,1,2,3])
data.columns=['No', 'ItemCode', 'ItemName', 'Unit', 'Quantity', 'Price', 'Sum']
The program specifies that the table headers not be read and that the reading skips the first 4 rows to begin from the 5th row through parameters (you can specify skipping the table footers if they exist). It set the alias for data set “data” in the last line of code.
To read a crosstab (as shown below):
You can use the following Python program:
import pandas as pd
file = 'cross.xls'
data = pd.read_excel(file,sheet_name='Sheet1',header=1)
data = data.melt(id_vars=['Unnamed: 0'],
value_vars=['West', 'East','Center', 'North','South', 'Northwest','Southwest'],
var_name='Area',
value_name='Amount')
data.rename(columns={'Unnamed: 0': 'Type'})
The retrieved data is as follows:
It’s simpler to read an Excel file in Python than in Java. As Pandas encapsulates structured computation function, Python gives stronger support of the subsequent computations than Java and VBA. These make it easy for Python to process small Excel files that can wholly loaded into the memory.
Python’s problem is that it doesn’t have batch processing methods for big Excel files that cannot be loaded into the memory at once. So you have to write complicated program for both reading and subsequent computations. Read How Python Handles Big Files learn more.
As a professional open-source data processing tool, esProc SPL offers various methods of reading Excel files. SPL, the scripting language esProc is based, encapsulates a rich library of structured computation functions that can handle all subsequent computations and the exporting of result set and writing it to the database excellently.
The SPL program for reading Excel data is surprisingly simple. It has only a single line of code:
- Read simple format Excel file
=file("simple.xls").xlsimport@t()
@t option means reading the 1st row as the column headers
- Read an Excel sheet with complicated table headers
=file("complex.xls"). xlsimport(;1,5).rename(#1:No,#2:ItemCode,#3:ItemName,
#4:Unit,#5:Quantity,#6:Price,#7:Sum)
Parameters 1,5 specify the reading of sheet1 beginning from the 5th row (you can also specify an ending row); rename function modifies the column names.
- Read a crosstab
=file("cross.xls").xlsimport@t(;1,2).rename(#1:Type).pivot@r(Type;Area,Amount)
pivot function groups records by Type and performs transposition. @r option enables a column-to-row transposition. The new column names after transposition are “Area” and “Amount”.
Apparently esProc SPL produces more concise code than Python pandas in parsing the Excel file. Actually SPL has a more competitive edge on handling the subsequent computations. Read Looking for the Best Lightweight Data Analysis Script Tools to find more.
On top of that, esProc SPL has the cursor mechanism to handle the reading and computation of big Excel files in a convenient way. This enables data analysts to process a relatively large amount of data using syntax similar to that of processing small amounts of data and in an intuitive, simple way. To read a big Excel file with simple format, for example, the program using the cursor is as follows:
=file("big.xlsx").xlsimport@tc()
The conclusion: It’s concise to parse an Excel file in both Python pandas and esProc SPL. Both languages are equipped with a wealth of structured computation functions to deal with daily analysis work well. But, esProc SPL is more concise and convenient, particularly in handling big files.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code