-
Notifications
You must be signed in to change notification settings - Fork 339
How does the in memory database bring memory’s advantage into play?
The data access speed of the in-memory database can be several orders of magnitude higher than that of the ordinary database that mainly relies on the disk to store data, and hence the in-memory database can greatly improve computing performance, and is more suitable for high-concurrency and low-latency business scenarios.
However, most of the current in-memory databases still use the SQL model, and SQL lacks some necessary data types and operations, and cannot make full use of the features of memory to implement some high-performance algorithms. Simply moving the data and operations from external storage into memory can indeed achieve much better performance, but it does not make full use of the features of memory, therefore, an extreme performance cannot be obtained.
Now let's see what algorithms and storage mechanisms are suitable for the features of memory, and can further improve the computing speed of in-memory database.
We know that the memory can be accessed through address (pointer). However, SQL does not have the data objects represented by in-memory pointer, and it usually needs to copy the data to form a new data table when returning the result set. This will not only consume more CPU time (for copying data), but occupy more expensive memory space (for storing the copied data), resulting in a decrease in memory usage rate.
In addition to SQL-model in-memory databases, the same problem exists in the RDD in Spark, and the situation is more serious. In order to keep the immutable characteristics of RDD, Spark will copy a new RDD after each calculation step, it will cause a lot of waste in memory and CPU. Therefore, even if huge resources are consumed, Spark still cannot achieve high performance. In contrast, the SQL-model in-memory databases are usually optimized, and calculation in the SQL statement will use the in-memory addresses as much as possible, so such databases usually perform better than Spark.
However, due to theoretical limitations, when implementing the logic of SQL, the returned result set must be copied. If a multi-step procedural operation is involved, further calculation is required many times based on the result set (temporary table) of the previous step, this will make the disadvantage of SQL very obvious.
In fact, if the data structure is not changed, we can directly use the addresses of the original data to form a result set, without the need to copy the data itself. In this way, we only need to save one more address (pointer), thus reducing CPU and memory consumption.
SPL extends the data types of SQL and supports the pointer-style reuse mechanism. For example, after filtering the order table by the range of the order date (odate), we want to find out the orders with order amount (amount1) greater than 1000 and the orders with each shipping fee (amount2) greater than 1000 respectively, and then calculate the intersection, union and difference of the two kinds of orders, and finally sort the difference by the customer number (cid). SPL code is roughly as follows:
A | B | |
1 | =orders.select(odate>=date(2000,1,1) && odate<=date(2022,1,1)) | |
2 | =A1.select(amount1>1000) | =A1.select(amount2>1000) |
3 | =A2^B2 | =A2&B2 |
4 | =A2\B2 | =B2\A2 |
5 | =A4.sort(cid) | =B4.sort(cid) |
In this code, although there are multiple steps, and some intermediate results are also used multiple times, since what are used are the pointer of the records in the order table, the increased memory occupation is very limited, and the time consumed in copying the records is avoided.
Foreign key association refers to using the non-primary key fields of one table (fact table) to associate the primary key of another (dimension table). For example, use the customer number and product number in the order table to associate the primary keys of the customer table and product table, respectively. In practice, such association may involve as many as seven/eight or even a dozen tables, and there may also be multiple-layer association. Usually, SQL database uses the HASH JOIN algorithm to perform in-memory join, and this algorithm needs to calculate and compare the HASH values, and in this process, the memory will be occupied to store intermediate results. When there are many associated tables, the computing performance will drop sharply.
In fact, we can also use the in-memory pointer reference mechanism to pre-associate, that is, convert the value of the association fields in the fact table to the pointer of record in the corresponding dimension table when the system is initialized. Since the association field of dimension table is the primary key, the associated record is unique, and converting the foreign key value to the pointer of record will not cause an error. In subsequent calculations, when it needs to reference the dimension table field, we can use the pointer to reference directly, without the need to calculate and compare the HASH values, and the need to store the intermediate results, hereby obtaining better performance. Because SQL has no the data type “record pointer”, the pre-association cannot be achieved.
SPL supports and implements this pre-association mechanism in principle. For example, the code for pre-associating the order table with customer table and product table is roughly as follows:
A | |
1 | =file("customer.btx").import@b().keys@i(cid) |
2 | =file("product.btx").import@b().keys@i(pid) |
3 | =file("orders.btx").import@b().switch(cid,A1;pid,A2) |
4 | >env(orders,A3) |
A1 and A2 load the customer table and product table respectively.
A3: Load the order table, and convert the customer number cid and product number pid to the pointer of records in the corresponding dimension table.
A4: Store the pre-associated order table into the global variables for subsequent calculations.
When the system is running, filter the orders by product supplier, and then group and aggregate by the city where the customer is located. The code is roughly as follows:
A | |
1 | =orders.select(pid.supplier=="raqsoft.com").groups(cid.city;sum(pid.price*quantity)) |
Since the pid in the order table has been converted to the pointer of record in product table, the operator “.” can be used directly to reference the records in the product table. In this way, not only is the writing easier, but the computing performance is much faster.
When only two or three tables are associated, the difference between pre-association and HASH JOIN is not very obvious, for the reason that the association is not the final goal, and there will be many other operations after association, and the proportion of time consumed at association operation itself is relatively small. On the contrary, if the association situation is more complex, involving many tables, or involving multiple layers (for example, the order is associated with the product, the product is associated with the supplier, the supplier is associated with the city, and the city is associated with the country and so on), the performance advantage of pre-association will be more obvious.
Compared with external storage, another important feature of memory is that it supports high-speed random access, which can quickly fetch the data from the in-memory table by the specified sequence number (i.e., position). When performing the search computing, if the to-be-searched value is exactly the sequence number of the target value in the in-memory table, or it is easy to calculate the sequence number of target value through the to-be-searched value, we can use the sequence number to take the target record directly. This method can directly take the search result, without the need to perform any comparison, and its performance is not only much better than traversal method, but better than the search algorithm using index.
Unfortunately, SQL is based on the unordered set, and cannot take the member by sequence number, and instead, it can only search by sequence number. If there is no index, searching can only be done by traversal, which will be very slow. Even if there is an index, it needs to calculate the HASH value or uses the binary search method, and the speed is not as fast as direct positioning. Moreover, creating an index will also occupy a lot of expensive memory. If there is no sequence number in the data table, it needs to sort first and then create the sequence number, the performance will be worse.
SPL is based on the ordered set, and provides the sequence number positioning function. For example, there is an order table, and the order numbers in it are the natural number starting from 1. When searching for the order number i, we just need to take the i-th record in the order table. For another example, there is a data table T that stores one piece of data per day from the year 2000 to 2022, and now we want to query the record of a specified date. Although the date is not the sequence number of the target value, we can first calculate the number of days from the starting date to the specified date, this number is the sequence number of target value. After that, we only need to take the record in table T with the calculated sequence number. The code for searching for the record on April 20, 2022 in table T by using the sequence number positioning method is roughly like this:
A | |
1 | =date(2022,12,31)-date(1999,12,31) |
2 | =T_orginal.align@b(to(A1),dt-date(1999,12,31)) |
3 | =env(T,A5) |
4 | =T(date(2021,4,20)-date(1999,12,31)) |
A1: Calculate the total number of days from the year 2000 to 2022, which is 8401 days.
A2: Use the records of original table T to calculate the number of days from the starting date, and then align it with the natural number set to(A1), i.e., [1,2,3,…,8401]. The vacant date will be filled with null, and the option @b of align means that the binary search will be used to search for the position during alignment, this will make the alignment action faster.
A3: Put the calculated result into the global variable T.
A4: To search for the record of April 20, 2021, first find the number of days (7781 days) between this date and starting date, and then directly take the 7781st record in table T.
A1 to A3 are alignment calculations, which are used to process the vacancy date, and can be performed in the system initialization phase. During the search calculation, the searched result can be obtained by using the sequence number positioning code in A4, and the actual searched date can be passed in as a parameter.
When the amount of data is so large that the memory of one machine cannot hold, it needs to use the cluster to load the data. Many in-memory databases also support the distributed computing, which usually works in a way that divides data into multiple segments first, and then loads them into the memory of different nodes of cluster respectively.
JOIN is a troublesome task in distributed computing as it will involve data transmission between multiple nodes. In serious cases, the delay caused by transmission will offset the benefits obtained from sharing the calculation amount by cluster, and a phenomenon that the cluster becomes larger but the performance cannot be improved may occur.
The distributed database under the SQL system is usually to extend the HASH JOIN method for single machine to the cluster. Specifically, each node will distribute its own data to other nodes according to the HASH value to ensure that the associated data is on the same node, and then perform the single machine join operation on each node. However, this method may cause serious imbalance in data distribution when it is unlucky. In this case, it needs to use external storage to buffer the distributed data, otherwise the system may crash due to out-of-memory. But, for all we know, the main feature of in-memory database is simply to load data into memory for calculation, and the computing performance will be seriously slowed down once the external storage is used to buffer the data.
In fact, there is a big difference between the fact table and the dimension table that are associated through foreign key. The fact table is generally relatively large, which needs to be loaded in segments into the memory of every node. Luckily, the fact table is also more suitable for segmentation, and the data of each segment is independent of each other, and the nodes do not need to access each other. On the contrary, the records of dimension table will be accessed randomly, and any segment of fact table may associate all dimension table records. Therefore, we can make use of the difference between fact table and dimension table to speed up foreign key association of cluster.
If the dimension table is relatively small, load all data of dimension table to the memory of each node. In this way, we can continue to pre-associate the fact table segments and full dimension table in each node, completely avoiding network transmission during the association process.
If the dimension table is also so large that the memory of one machine cannot hold, it has to be loaded in segments into the memory of all nodes. In this case, no any node holds a full dimension table, and hence it is unavoidable to generate network transmission during the foreign key association calculation. The transmission content, however, is not very large, only involving the fields of the associated record between the fact table foreign key and the dimension table, and other fields of fact table do not need to be transmitted. The calculation can be achieved directly, and no buffer data is generated in this process.
SPL distinguishes the dimension table from fact table in principle, and provides the dimension table duplicate mechanism and the segmented dimension table mechanism respectively for small dimension table and large dimension table, and implements the above algorithm, which can significantly improve the calculation performance of foreign key associations in the case of cluster.
When it comes to cluster system, the fault tolerance must be considered, and the fault tolerance of in-memory data differs from that of external storage. External storage generally uses the method of copying the data, that is, the same data has multiple copies, and can still be found on other nodes once a certain node fails. The storage utilization rate of this mechanism is very low, only 1/K (k is the number of copies).
For the data in memory, however, such copy-pattern fault tolerance method does not work, for the reason that the hard disk is so cheap that its capacity can be expanded almost infinitely, while the memory is much more expensive and there is an upper limit on its capacity expansion. Therefore, a utilization rate of only 1/k is unacceptable for memory.
The fault tolerance for memory requires specialized means different from external storage. SPL provides the spare-wheel-pattern fault tolerance mechanism, which divides the data into n segments and loads them into the memory of n nodes respectively. At the same time, k idle nodes are prepared acting as spare node. In this way, when one running node fails, a certain spare node will be started immediately to instantly load the data of the failed node, and reconstitute a cluster with complete data together with other nodes to continue to provide services. After troubleshooting, the failed node will return to normal state, and can be used as a spare node. The whole process is very similar to the mode of replacing the spare wheel of a car.
The memory utilization rate of this mechanism can reach up to n/(n+k), which is much higher than 1/k of copy-pattern fault tolerance mechanism. In this mechanism, the amount of data to be loaded into memory is usually not very large, and the instant loading time is not much when the node fails, and hence the cluster service can be restored quickly.
For the computing system of in-memory database, only by making full use of the features of memory can we obtain an extreme performance. From the perspective of data computing, the main advantages of memory include: it supports the pointer reference and high-speed random access, and its concurrent reading ability is powerful, while its disadvantages are: high cost and limited capacity expansion.
Unfortunately, the SQL computing system lacks some necessary data types and operations, for example, it lacks the data type “record pointer”; it does not support the ordered operation; it defines the JOIN too general; it does not distinguish JOIN types, and hence the above-mentioned features of memory cannot be fully used to implement some high-performance algorithms in principle. Usually, the SQL-based in-memory database is just to copy the data and operations of external storage, which will cause various problems. For example: the record-based copying consumes too much CPU and memory; the searching and JOIN performances don't reach an extreme. On the other hand, for the cluster: the memory utilization rate is too low; A large number of network transmission amount leads to an increase in the number of nodes but a decrease in performance; it needs the external storage to buffer the data in the case of multi-machine JOIN.
SPL, an open-source data computing engine, expands the data types and operation definitions, which can make full use of the features of memory to implement a variety of high-performance algorithms, and make the performance extreme. Specifically, the pointer-style reuse mechanism utilizes the reference mechanism peculiar to memory, which not only saves memory space but makes the speed faster. The pre-association method also uses the pointer reference mechanism to achieve the time-consuming foreign key association in the initialization stage, and the associated result can be directly used in the subsequent calculations, and thus the calculation speed is significantly improved. The sequence number positioning algorithm uses the ordered characteristics and gives full play to the advantages of high-speed random access of memory. Instead of performing any calculation and comparison, this algorithm can read the records directly through sequence number, and hence its performance is better than that of search algorithms such as HASH index. The cluster dimension table effectively avoids or reduces network transmission, and avoids external storage from buffering the data. The spare-wheel-pattern fault tolerance mechanism effectively improves the memory utilization rate in the case of cluster under the premise of ensuring high availability.
In addition, SPL also provides other methods such as serial byte, sequence number index, data type compression, etc. Programmers can use these methods in a targeted manner according to specific scenarios. In this way, the advantages of memory can be fully brought into play, hereby effectively improving the performance of in-memory data computing.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code