-
Notifications
You must be signed in to change notification settings - Fork 339
SQL Performance Enhancement:DISTINCT & COUNT(DISTINCT) on Big Data
The essay analyzes the fundamental reason behind slow SQL statements and provides the solution to make improvements through code samples. Looking ${article} for details.
Getting distinct values is, in essence, a type of grouping operation that needs a traversal of source data. A distinct result set needs to be kept during the whole computing process. Each source record will be searched for in the result set to see if there is already the same one, and based on the searching result, it will be discarded or appended to the result set.
If the result set is relatively small, the speed of memory search is acceptable. If the result set cannot fit into the memory, the HASH partitioning algorithm will be used to divide the source data and buffer each part separately into the external memory. Then we perform DISTINCT on each in-memory result set. There will be external memory read & write in the process and that results in bad performance.
The above-mentioned distinct result set is also needed to do the pure count operation COUNT(DISTINCT), and thus same issues exist.
The whole process will become simple if the source data table is ordered by the field on which DISTINCT operation will be performed. We just need to save each field value that is different from its previous one. This does not need any partitioning operation, even if the result set cannot fit into the memory. For COUNT(DISTINCT), we only need to add 1 when the current value of the target field is different from the previous one. Even a result set is not needed.
Usually, the field on which DISTINCT (or COUNT DISTINCT ) is performed is one with a limited number of definite values, such as User Account. It is seldom that any field is specified to do DISTINCT ( or COUNT DISTINCT) operation. Data will be first sorted to accelerate the computation.
Creating index on a field is, to some degree, equivalent to sorting, making the computation fast. Generally, the DISTINCT operation is preceded and followed by other computations, such as filtering before DISTINCT, which may use any field in the source data table. If the filtering field isn’t covered by the index, the index cannot be used to perform DISTINCT. Even creating an index on the to-be-DISTINCTed field gives not enough help to get the operation well done.
Often the newly-added data isn’t ordered by the to-be-DISTINCTed field. It should not be directly appended at the end of the already ordered data. For instance, for the source data ordered by user account, the newly-added data could contain any user account. If we append it directly at the end the ordered source data, the result won’t be ordered by the user account anymore. And it will be time-consuming if we combine the new data with the existing data and do a full sorting.
Instead, we can sort the newly-added data first and generate a new ordered table based on it and the old ordered data using the cheap order-based merge algorithm. Its overall degree of complexity amounts to reading and writing the whole data once. The method can avoid the frequent temporary external buffering that happens in the ordinary full sorting and thus get higher performance.
Further, we can keep a small-scale ordered table (referred to as patch data below), which will merge with the sorted newly-added data while keeping the old ordered data unchanged. Over time when the patch data gets accumulated, it will be merged with the old ordered data. DISTINCT operation retrieves both the old ordered data and the patch data, merges and traverses them. The performance decreases a little compared with handling one table of ordered data, but the data orderliness can still make the computation much faster.
The time when to merge the patch data with the old ordered data is related to the cycles when data should be added. If data is added each day, we can do the merge each month. The patch data stores data in one month at most and the existing ordered data contains all data one month ago. The patch data could be far smaller than the old ordered data, so the daily size of to-be-merged data is small and the data appending is fast. As we do the whole-data order-based merge once a month, it is acceptable that it takes a little longer to complete.
Below are SQL statements to achieve DISTINCT and COUNT(DISTINCT):
SQL1: select distinct F1,…,Fn from T where …
SQL2: select count(distinct F1,…,Fn) from T where …
Step 1: Data pre-processing and order-based storage.
A | |
---|---|
1 | =file("T-original.ctx") |
2 | =A1.open().cursor().sortx('F1',…,'Fn').new('F1',…,'Fn',…) |
3 | =file("T.ctx").create('#F1',…,'#Fn',…) |
4 | =A3.append@i(A2) |
A1: The original composite table T-original.ctx before conversion.
A2: Open A1’s composite table, create cursor on it, sort the cursor by the to-be-DISTINCTed field, and return a result cursor where the DISTINCTed field is put ahead. Since F1…Fn are namesake with cell names, they are enclosed with single quotes. Other field names do not need to be single quoted.
A3: Create a new composite table *T.ctx *where certain field names are preceded with the pound sign #. That means the table is ordered by these fields (in their order and they need to be among the first several fields).
A4: Append the sorted data to T.ctx.
Step 2: DISTINCT operation in SQL1.
A | |
---|---|
1 | =file("T.ctx").open().cursor('F1',…,'Fn';…) |
2 | =A1.group@1('F1','F2','F3') |
3 | … |
A1: Open composite table T.ctx and get the to-be-DISTINCTed fields to generate a cursor. The filtering condition should be written as one of the cursor’s parameters and placed after the semicolon, such as cursor('F1',…,'Fn';Fx==1 && Fy<10 && Fz>200), which is equivalent to the SQL clause where Fx=1 and Fy<10 and Fz>200.
A2: Define an order-based grouping on A1’s cursor. The function will get the first record from each group and return all the first records as a cursor (@1 option uses number 1 instead of the lowercase letter l).
A3: If the result set is too large to be wholly retrieved at once, we can further handle the result cursor and just save the result set.
Step 3: Count(DISTINCT) operation in SQL2.
A | |
---|---|
1 | =file("T.ctx").open().cursor('F1',…,'Fn';…) |
2 | =A1.group@1('F1','F2','F3') |
3 | =A2.skip() |
A3: Perform count on A2’s cursor.
Step 4: Data appending.
Suppose the newly-added data each day is stored in T_new.btx that has the same fields of the same order as T.ctx:
A | B | |
---|---|---|
1 | =file("T.ctx").open() | |
2 | =file("T-new.btx").cursor@b().sortx('F1',…,'Fn') | |
3 | if (day(now())==1) | >A1.reset() |
4 | >A1.append@a(A2) |
A1: Open composite table T.ctx.
A2: Define cursor on T_new.btx and sort it. Usually the volume of daily newly-added data is small, so in-memory sorting is actually performed though sortx function is used and the computation is fast.
A3: Check whether the current date is day 1, and execute A4, if it isn’t, to merge the record with the patch data using append@a; or execute B3, if it is, to merge the main data with the patch data using reset function.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code