-
Notifications
You must be signed in to change notification settings - Fork 339
Just How Slow is Data Retrieval via JDBC
A JAVA program retrieves data from the database through JDBC. The retrieval speed is slow even sometimes the database load is not heavy at all and the SQL statement is simple.
Now let’s do a speed testing (take Oracle as the example) to better understand this.
Data is generated using the TPC-H data generator. We use the customer table to perform the test. The table has 1500,0000 rows and 8 fields. The original text file name is customer.tbl and has a size of 2.3G. We are trying to load data from the text file to a data table in Oracle using the data loading tool provided by the database.
Two Intel3014 processors of 1.7 G. Each CPU has 6 cores.
SSD: 1T, 561MB/s read and 523MB/s write
Interface: SATA 6.0Gb/s
RAM: 64G
OS: Linux CentOS 7
The whole test is performed locally on the server without any network transmission.
We are trying to carry out the data retrieval using the SQL statement through Oracle JDBC.
It is complicated to write the retrieval in Java. Instead, we do the test using a SPL (Strucrtured Process Language) script:
A | B | |
---|---|---|
1 | =now() | / Record the current time |
2 | =connect("oracle") | / Connect to the oracle database |
3 | =A2.cursor("SELECT * FROM CUSTOMER") | / Create a cursor from which data will be fetched |
4 | for A3,10000 | / Retrieve data circularly, 10000 records each time |
5 | =A2.close() | / Close database connection |
6 | =interval@s(A1,now()) | / Calculate the time spent in data retrieval |
Test result: 275 seconds
To let you have more perceptual understanding about this, we will take the text file to do the test as we did in above. We read data from the text file and parse records without performing any specific computations.
Write the following test SPL script:
A | B | |
---|---|---|
1 | =now() | / Record the current time |
2 | =file("customer.tbl") | / Create a file object |
3 | =A2.cursor(;," | ") |
4 | for A3,10000 | / Retrieve data circularly, 10000 records each time |
5 | =interval@s(A1,now()) | / Calculate the time spent in data retrieval |
Test result: 43 seconds
Reading a text file is 275/43=6.4 times faster than reading data from the Oracle database.
Text parsing is complicated though, reading data from a text file is far faster than retrieving data from the database.
When data can only be retrieved from the database and the database is not heavy-loaded, the multi-CPU-based parallel processing can be used to accelerate data retrieval. It is a great hassle to write the algorithm in Java because a series of issues, including resource sharing and collision, must be considered.
SPL’s parallel processing technique can enhance the database JDBC’s retrieval performance and avoid the complex JAVA hardcoding and facilitate concatenation of the result sets returned by multiple threads.
Below are our tests.
Here we still use the above customer table. Data is retrieved faster using the parallel processing. Below is the SPL code:
A | B | |
---|---|---|
1 | =now() | / Record the current time |
2 | =connect("oracle").query@ix("SELECT COUNT(*) FROM CUSTOMER")(1) | |
3 | >n=12 | / Define the number of parallel threads |
4 | =n.([int(A2/n)*(~-1),int(A2/n)*~]) | / Divide the table into segments according to the number of parallel threads |
5 | fork A4 | =connect("oracle") |
6 | =B5.query@x("SELECT * FROM CUSTOMER WHERE C_CUSTKEY>? AND C_CUSTKEY<=?",A5(1),A5(2)) | |
7 | =A5.conj() | / Concatenate result sets returned by threads |
8 | =interval@s(A1,now()) | / Calculate the time spent in data retrieval |
Test result: 28 seconds (275 seconds without parallel processing)
To perform the parallel retrieval, we need to divide the source data into relatively even intervals. In this example, C_CUSTKEY field contains natural numbers beginning from 1. We first count the total records (A2) and then split them into n segments evenly (A4). A5 performs the parallel retrieval, where each thread connects to the database and execute SQL using the intervals of C_CUSTKEY values as the parameter. Finally, result sets of these multiple threads are concatenated to get the final result.
In real-world situations, it is probably that we need a certain method to set the WHERE condition to get nearly even intervals.
We can also use parallel processing to make multi-SQL retrieval faster.
In this case, data is TPC-H generated and the total volume decreases to 5G. We use 5 tables in this test. Below is the SPL script without using parallel processing. The result is unsatisfactory.
A | B | |
---|---|---|
1 | SELECT * FROM SUPPLIER | |
2 | SELECT * FROM PART | |
3 | SELECT * FROM CUSTOMER | |
4 | SELECT * FROM PARTSUPP | |
5 | SELECT * FROM ORDERS | |
6 | =now() | / Record the current time |
7 | =connect("oracle") | /Connect to the database |
8 | =[A1:A5].(A7.query(~)) | / Execute SQL statements one by one in turn |
9 | >A7.close() | /Close database connection |
10 | =interval@ms(A6,now()) | /Calculate the time spent in data retrieval |
Test result: 360 seconds
While performance is considerably improved after we employ parallel retrieval. Below is the code:
A | B | |
---|---|---|
11 | =now() | / Record the current time |
12 | fork [A1:A5] | =connect("oracle") |
13 | =B12.query@x(A12) | |
14 | =interval@ms(A11,now()) | /Calculate the time spent in data retrieval |
Test result: 167 seconds
Though data probably cannot evenly be divided in the multi-table scenarios, parallel processing is still an effective way to increase retrieval performance significantly.
esProc SPL is an opensource software developed in Java, it provides JDBC interface and is very easy to be integrated in Java Applications.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code