-
Notifications
You must be signed in to change notification settings - Fork 339
Using Java Python in place of stored procedure for batch jobs?
Using stored procedure for batch jobs has always been a common practice. Proceduralizing SQL through stored procedure can meet relatively complex batch job scenarios, and running in the database (without exporting data out of database) can obtain relatively good performance. But the stored procedure also has many shortcomings, which are embodied in the following aspects: i) difficult to edit and debug, and lack of an effective development environment; ii) poor migration, and difficult to adapt to the requirement of changing data source; iii) poor scalability (when the performance is insufficient, it is unable to scale out and can only scale up), and lower capacity ceiling; iv) relying heavily on stored procedure will exert excessive pressure on database. Due to the difficulty of scaling out, and the inability to change data source (database), slow batch job often occurs. As the business continues to grow, the pressure on database becomes more and more obvious.
In order to avoid these shortcomings (and because some AP-based databases cannot support stored procedures well, or even do not support at all), many scenarios now begin to use Java/Python for batch jobs. Using high-level language for batch jobs is more flexible, and some complex multi-step operations (such as ordered calculation) that are difficult to implement in SQL is simple to implement in Java/Python. In addition, high-level languages have stronger migration, and their computing logics on batch jobs are all outside the database, and hence it is very easy to adapt to the change of data sources. Moreover, parallel and distributed computing programs can be written through hard coding in theory, making it possible to adapt to the needs of batch job with any data volume.
However, using Java/Python for batch job has some significant shortcomings.
Both Java and Python provide some computing libraries, allowing us to accomplish some basic calculations conveniently with the help of Stream and Pandas. However, both of them only provide in-memory computing solution, and don’t provide the specific cursor computing mechanism, resulting in poor ability to compute big data.
When using Java to implement a computing task which exceeds the capacity of memory, it needs to adopt an indirect way. Let’s take the sorting operation as an example, Java first needs to divide large amount of data into multiple small files, then sort the data of each small file, and finally merge all the sorted small files to generate an ordered large file. This way avoids loading all data into memory to sort, and hence the sorting operation exceeding the capacity of memory is implemented. Although it sounds simple, it is not easy to implement.
This is also the case for grouping operation. Usually, the databases provide directly usable hash grouping algorithm, so when using database to perform hash grouping, the functionality that comes with database can be applied directly, and the performance will be ok in general. In contrast, although Java provides HashMap, it works only for in-memory computing. When the amount of data exceeds the limit of memory, the implementation of Hash grouping in external storage will be much more complicated, which is too difficult for many application programmers. As a result, they often have to resort to simple methods such as sorting in external storage to accomplish the grouping, yet, the performance is affected though the difficulty is reduced.
Similarly, Python works well for in-memory computing, and the performance of its basic functions (such as groupby and merge) is also good. Especially for the matrix operation of pure number, Python can achieve very good performance. But Python has the same problems as Java. When faced with a calculation which exceeds the memory capacity, programmers usually need to read the data in segments and code themselves to accomplish calculation. Since complex algorithms like Hash are difficult to implement, and unfamiliar to many programmers, so they have to resort to simple algorithms such as sorting in external storage as mentioned above to accomplish calculation. In short, the lack of external storage computing will lead to high implementation difficulty and poor performance.
As a result, when using Java/Python for batch jobs, we still need to resort to database to accomplish calculation in most cases, and let Java/Python direct the database to execute SQL, which is not much different from the way of original databased-based computing, and also puts most of the pressure on database side. This approach does not relieve much pressure of database, and almost all the adverse effects of original stored procedure still exist.
The amount of data involved in batch jobs is large, and parallel computing can effectively utilize the advantages of multi-core CPU to speed up calculation. Java itself provides the multi-thread mechanism, allowing us to divide a task into multiple subtasks, and execute the subtasks concurrently with multiple threads. Java 7 and higher versions also provide the Fork/Join framework, which can recursively split a large task into multiple small tasks and assign them to multiple threads to execute concurrently. Java 8 introduces Stream API, which provides the parallel stream. Parallel stream can split a large data set into multiple small data sets, then process the small data sets in parallel, and finally merge the results of every small data set to get final result. Parallel stream is equivalent to the extension and simplification of Fork/Join framework.
However, although Java is provided with parallel processing functionality, it is not easy to use. Specifically, for multi-thread programming, programmers need to consider various situations such as inter-thread task scheduling, thread communication and algorithm design, and also need to consider how to effectively segment the data so as to provide data for multiple threads in a balanced manner. One method is to split data into multiple files in advance, but this method is inefficient and inflexible; another method is to directly use the data sources like database, yet it cannot make use of the cursor to do data segmentation. As a result, the final performance is low.
Compared with Java, Python performs worse, and is equivalent to having no parallel processing ability. Parallel processing of Python itself is fake, and is actually serial processing for CPU, or even slower than serial processing, and thus it is difficult to leverage the advantages of modern multi-core CPU. There is a Global Interpreter Lock in the CPython interpreter (the mainstream interpreter of Python), and this lock needs to be got ahead of executing Python code, which means that even if multiple threads of CPU work in the same time period, it is only possible to execute the code with one thread and multiple threads can only execute the code in an alternately way. Yet, since multiple-thread execution involves complex transactions such as context switching and lock mechanism processing, the performance is not improved but decreased.
Since Python cannot make use of simple multi-thread parallel processing mechanism in one process, many programmers have to adopt the complicated multi-process parallel processing method. The cost and management of process itself are much more complex, and the effect of parallel processing cannot be comparable to that of multiple threads. In addition, the inter-process communication is also very complex, programmers have to give up direct communication sometimes, and use file system to transfer the aggregation result instead, which leads to a significant performance decrease.
The performance of computing big data is largely related to data IO. If the IO cost is too high, it doesn’t work no matter how fast the operation speed is. Efficient IO often relies on specially optimized storage scheme. Unfortunately, however, both Java and Python lack a widely used efficient storage scheme, and instead they use text file or database to store data in general. The performance of database interface is very poor. Although text is a little bit better, it will consume too much time in parsing the data type, also resulting in lower performance.
Does it work if we develop one storage ourselves, and write data as binary format? It's not that simple. Although this method will solve the problem of data type parsing, it may occupy more space than text without a well-designed encode. The parsing time is reduced, but the hard disk read and write time is increased, so performance improvement may be very limited. In short, the development of an efficient storage needs to consider any mean that can improve efficiency, such as encode and compression, columnar storage and even index, which is not easy to implement.
If the data source itself is text file or database, there is nothing we can do except bear low IO speed. However, many complex operations (such as big data sorting) need to store intermediate results during computation; theoretically the read and write performance are controllable, yet due to lack of efficient storage scheme, we have to use inefficient text or database, resulting in low overall performance. In addition, some operations need to utilize a large amount of historical data, if such data are all read from text or database, an embarrassing situation where the IO time far exceeds computing time often occurs.
esProc SPL (Structured Process Language), an open-source programming language specifically designed for structured data computing, can handle batch jobs in place of stored procedures, and can effectively solve various problems that Java/Python encounters in batch jobs.
SPL has a simple and easy-to-use development environment that offers rich editing and debugging functionalities, making it more convenient to code than Java and Python; SPL runs outside the database, which means that the change of database type does not involve modifying SPL scripts, and that the migration issue of stored procedure is solved; SPL provides cursor computing method for big data processing, making it possible to easily process the data that exceeds memory capacity, and eliminating the need to store intermediate data, and achieving higher performance; SPL can compute in parallel on the cursor to further improve computing performance; SPL specifically designs the efficient binary file storage for achieving higher computing performance. When this storage works with high-performance algorithms built in SPL, the computing performance can be fully guaranteed.
In order to cope with big data computing, SPL specially provides the cursor mechanism to compute on external storage, in addition to computing in memory. And, the code for external storage computing is almost the same as that for in-memory computing, for example:
A | |
1 | D:\data\orders.txt |
2 | =file(A1).cursor@t() |
3 | =A2.groups(area;sum(amount):amount) |
A2 creates a file (text) cursor, so that the grouping and aggregating in A3 will not load all data into memory in one go, but calculates in batches, hereby achieving the purpose of grouping and aggregating big data. Moreover, except for the difference between the f.cursor() for creating an external storage cursor and the f.import() for in-memory computing, other steps are exactly the same, which effectively lowers the threshold to use SPL.
However, there are still some differences between external storage cursor computing and in-memory computing. Specifically, memory supports random and small-amount data access, while external storage (hard disk) only supports sequential and batch data access, and the cursor has to traverse data sequentially, and becomes invalid after traversing the data once. If we want to perform other calculations in external storage, we need to create cursor again, which is equivalent to traversing data multiple times. In batch jobs, however, multiple steps of operations are needed for the same data, and if the data needs to be traversed in each operation, the cost will be much higher. To solve this problem, SPL provides delayed cursor mechanism, which only binds the calculations on the cursor and will not calculate until fetching action (result set function) occurs. In this way, traversing the data only once can accomplish multiple steps of operations. For example, the following code is to filter, group and aggregate the data:
A | |
1 | D:\data\orders.txt |
2 | =file(A1).cursor@t() |
3 | =A2.select(amount>50) |
4 | =A3.groups(area;sum(amount):amount) |
Here, A3 is a delayed cursor. When executing to A3, just record the calculation, and filtering, grouping and aggregating operations will be performed when the data is fetched in A4.
It’s not over yet. What if the big data computing is not a multi-step calculation but multiple same-type calculations? For example, if we want to group and aggregate the amount by area as well as by product, which means that there will be multiple data fetching actions (result set function) at the same time, then the delayed cursor mechanism will not work. In this case, do we have to create another cursor to traverse again?
No, SPL provides the multipurpose traversal mechanism, enabling us to accomplish multiple types of calculations in one traversal by creating a synchronous channel on the cursor. For example:
A | B | |
1 | =file("orders.txt").cursor@t(product,area,amount) | |
2 | cursor A1 | =A2.groups(area;max(amount)) |
3 | cursor | =A3.groups(product;sum(amount)) |
4 | cursor | =A4.select(amount>=50).total(count(1)) |
Having created the cursor, use the cursor statement to create a channel on it, and attach operations on the channel. We can create multiple channels, if the cursor parameter is not written in the subsequent statements, it indicates the same cursor will be used.
With the help of the mechanisms such as cursor, delayed cursor, and multipurpose traversal (channel), SPL can easily handle big data computing. Currently, SPL provides a variety of external storage computing functions and can meet almost all big data computing requirements, which makes SPL far superior to Java and Python.
For big data computing, the performance is critical. We know that parallel computing can effectively improve computing efficiency. Besides the delayed cursor and multipurpose traversal that can guarantee the performance to a certain extent, SPL provides a multi-thread parallel processing mechanism to speed up calculation. Likewise, this mechanism is easy to use.
A | B | |
1 | =file("orders.txt") | |
2 | fork to(4) | =A1.cursor@t(area,amount;A2:4) |
3 | return B2.groups(area;sum(amount):amount) | |
4 | =A2.conj().groups(area;sum(amount)) |
The fork statement will start multiple threads to execute their own code blocks in parallel, and the number of threads is determined by the parameter following the fork. Moreover, the fork statement will assign these parameters to each thread in turn. When all threads are executed, the calculation result of each thread will be collected and concatenated for further operation.
Compared with Java and Python, SPL is much more convenient while using fork to start multiple threads for parallel computing. However, such SPL code is still a bit cumbersome, especially for counting the data of common single table and, attention should also be given that it may need to change the function (from count to sum) when re-aggregating the results returned from threads. To solve this problem, SPL provides a simpler syntax: multi-cursor, which can directly generate parallel cursors.
A | |
1 | =file("orders.txt") |
2 | =A1.cursor@tm(area,amount;4) |
3 | =A2.groups(area;sum(amount):amount) |
Using the @m option can create parallel multi-cursor, and the subsequent usage is the same as that of single cursor, and SPL will automatically execute parallel computing and re-aggregate the results.
Multipurpose traversal can also be implemented on multi-cursor.
A | B | |
1 | =file("orders.txt").cursor@tm(area,amount;4) | |
2 | cursor A1 | =A2.groups(area;sum(amount):amount) |
3 | cursor | =A3.groups(product;sum(amount):amount) |
By means of simple and easy-to-use parallel computing, the potential of multiple CPUs can be given into full play to guarantee the computing performance and reduce the time for batch jobs.
As discussed above, the database-based big data computing is very slow (due to high IO cost). Although the text-based big data computing is a little bit better, and the computing efficiency can be improved to a certain extent by utilizing the cursor and parallel computing technologies mentioned above, the efficiency to parse text is still very low, and thus an own high-performance storage is needed to further guarantee computing performance.
For this reason, SPL designs its own binary file storage, which is an efficient file format that integrates multiple mechanisms such as encoding, compression, columnar storage, index, and segmentation.
Currently, SPL provides two high-performance file types: bin file and composite table. The bin file adopts the compression technology (faster reading due to less space occupation), and stores the data types (faster reading as a result of avoiding parsing data type). Since the bin file supports the double increment segmentation mechanism that can append data, it is easy to implement parallel computing by utilizing the segmentation strategy, and the computing performance is ensured. The composite table supports the columnar storage, which has great advantage when the number of columns (fields) involved in calculation is small. In addition, the composite table not only supports the double increment segmentation mechanism, but adopts the index technology, allowing us to utilize the advantage of columnar storage, and improve the performance more easily through parallel computing.
In batch job business, a large amount of historical cold data is often involved. With high-performance storage, we can move these data from database to file to achieve higher computing performance. Even if the data cannot be moved out of database sometimes, it makes sense to use SPL’s high-performance storage to write buffer files or intermediate results and, the historical cold data can often be copied to file storage for efficient computation.
With the guarantee of high-performance storage, computing efficiency can be further improved. Moreover, SPL can implement some other performance-improving mechanisms based on its own storage, such as pre-cursor filtering technology.
Normally, filtering in external storage needs to read out one record first, and then judge whether this record meets the criteria and discard the records that fail to meet criteria. Such simple method will lead to some useless actions since the whole record needs to be read before judging. It would be more efficient if judgment was made ahead of record reading. For this reason, SPL provides the pre-cursor filtering mechanism, allowing us to attach a filter condition when creating a cursor. In this way, SPL will first read only the field values for calculating criteria. If the criteria are not met, this record will be discarded and go to next record; only when the criteria are met, the other required fields would be read out to create the record.
A | |
1 | =file("persons.ctx").open() |
2 | =A1.cursor(sex,age;age>=18).groups(sex;avg(age)) |
Similar to text file, SPL’s storage (bin file and composite table) also supports reading data in segments and computing in parallel, and it is also easy to use:
A | |
1 | =file("orders.ctx").open().cursor@m(;;4) |
2 | =A1.groups(area;sum(amount):amount) |
SPL’s storage also adopts a specially designed double increment segmentation technology to segment data. Furthermore, SPL’s composite table supports synchronous segmentation. When two associated tables are both large, the join operation of the tables can be performed in parallel through synchronous segmentation to improve computing performance.
Based on high-performance storage, SPL also provides many high-performance technologies, such as index mechanism, one-side partitioning algorithm, ordered merge algorithm and attached table mechanism. Backed by high-performance storage, the performance of SPL in computing big data is often several times to dozens of times faster than that of general databases, and more advantageous than Java and Python.
What's more, SPL runs independently of database, which makes it possible to has the framework advantages that Java/Python brings and, SPL has good integration and openness. Integration enables SPL to be embedded in application, and the openness allows SPL to access diverse data sources and perform mixed computing. Modern batch jobs will involve multiple data sources sometimes, and the good openness of SPL makes it possible to perform mixed computing directly based on such data sources without having to store data in advance (such as loading into database), and thus it is more convenient.
Using SPL for batch jobs has been widely applied in many enterprises, and achieved very significant effect. The following are part of actual cases:
Open-source SPL speeds up batch operating of bank loan agreements by 10+ times
Open-source SPL optimizes batch operating of insurance company from 2 hours to 17 minutes
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code