-
Notifications
You must be signed in to change notification settings - Fork 339
Performance optimization skill:Association For Dimension Table Filtering & Computation
Often a join query between the fact table and a dimension table involves filtering and computation over the dimension data, which can be optimized in two approaches:
1. First join the fact table and the dimension table (pre-associate them if they can fit into the memory) and then filter the associated fact table just like the calculations in Performance Optimization Skill: Pre-Association and Performance Optimization Skill: Numberizing Foreign Key.
2. First perform filtering on the dimension table and then join the filtered table with the fact table. A join requires an index on the dimension table’s key. But after filtering the index cannot be used anymore and a new one needs to be created.
Their effects are influenced by the proportion of the size of dimension table(s) to that of the fact table. So we need some tests to find which one is better in different contexts.
Eight data tables, 50G data in total, have been generated according to TPCH standard. The structure of TPCH data file is vastly introduced online, which will not be elaborated here.
The server for test has two Intel2670 CPUs, 2.6G frequency, 16-core in total, 128G memory and an SSD hard disk.
The following tests are calculated in single-thread without the help of multi-core to make the difference clearer.
All-in-memory processing means to pre-load all the source data in memory. The data to be computed here are customer, the dimension table with 7.5 million records in total, and orders, the fact table with 75 million records.
The filtering conditions on the dimension table are “left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal”. The goal is to calculate the total orders amount meeting these conditions. The first two conditions are always true (to increase the computation amount for the purpose of emphasizing the differences in tests). Parameter bal is used to test the query speeds with different proportions of dimension data size and fact data size.
The SPL script with pre-joined dimension table and fact table:
A | |
---|---|
1 | >customer=file("/home/ctx/customer.ctx").open().memory().keys@i(C_CUSTKEY) |
2 | >orders=file("/home/ctx/orders.ctx").open().memory() |
3 | =orders.switch(O_CUSTKEY,customer) |
4 | =now() |
5 | =orders.select(left(O_CUSTKEY.C_NAME,4)!="shen" && O_CUSTKEY.C_NATIONKEY>-1 && O_CUSTKEY.C_ACCTBAL>bal) |
6 | =A5.sum(O_TOTALPRICE) |
7 | =interval@s(A4,now()) |
A1 loads the dimension table in memory and creates an index on its keys. A2 loads the fact table in memory. A3 pre-associates them. But these previous steps are not counted in test time which only starts from the execution of A4.
SPL script:
A | |
---|---|
1 | >customer=file("/home/ctx/customer.ctx").open().memory().keys@i(C_CUSTKEY) |
2 | >orders=file("/home/ctx/orders.ctx").open().memory() |
3 | =now() |
4 | =customer.select(left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal).derive@o().keys@i(C_CUSTKEY) |
5 | =orders.switch@i(O_CUSTKEY,A4) |
6 | =A5.sum(O_TOTALPRICE) |
7 | =interval@s(A3,now()) |
A4 filters customer table and then recreates an index on it. A5 joins the two tables.
SPL supports the reuse of an existing index. So we just need to change A4 in the above script into:
=customer.select@i(left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal)
Add @i option to select in order to reuse customer table’s index.
First load the numberized composite tables customer_xh.ctx and orders_xh.ctx when pre-loading data tables.
SPL script:
A | |
---|---|
1 | >customer=file("/home/ctx/customer_xh.ctx").open().memory() |
2 | >orders=file("/home/ctx/orders_xh.ctx").open().memory() |
3 | =now() |
4 | =orders.switch@i(O_CUSTKEY,customer:#) |
5 | =A4.select(left(O_CUSTKEY.C_NAME,4)!="shen" && O_CUSTKEY.C_NATIONKEY>-1 && O_CUSTKEY.C_ACCTBAL>bal) |
6 | =A5.sum(O_TOTALPRICE) |
7 | =interval@s(A3,now()) |
A numberization-based association doesn’t need an index, so we don’t create index in A1. customer:# in A4 is used to associate O_CUSTKEY values with the row numbers of customer table.
First load the numberized composite tables customer_xh.ctx and orders_xh.ctx when pre-loading data tables.
SPL script:
A | |
---|---|
1 | >customer=file("/home/ctx/customer_xh.ctx").open().memory() |
2 | >orders=file("/home/ctx/orders_xh.ctx").open().memory() |
3 | =now() |
4 | =customer.(left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal) |
5 | =orders.select(A4(O_CUSTKEY)) |
6 | =A5.sum(O_TOTALPRICE) |
7 | =interval@s(A3,now()) |
In A4, use customer. (filtering condition) to calculate a sequence of the same length as the number of records with true or false as its values, which is called alignment base sequence. Since the numberized O_CUSTKEY field values in orders table corresponds to row numbers in the customer table in order, we can use A4(O_CUSTKEY) in A5 to check if the row in orders meets the filtering condition.
Here are results of the above test scripts (Unit: second):
Number of records of filtered dimension table | 7.16 million | 6.13 million | 4.77 million | 2.73 million | 0.68 million |
---|---|---|---|---|---|
Pre-association | 41 | 39 | 38 | 37 | 35 |
Index recreation | 39 | 34 | 29 | 25 | 19 |
Index reuse | 35 | 31 | 27 | 23 | 17 |
Numberized foreign key | 53 | 51 | 49 | 48 | 46 |
Alignment base sequence building | 25 | 23 | 21 | 19 | 16 |
In these tests, the records in fact table (75 million) is ten times as many as those in dimension table (7.5 million).
Both pre-association and numberized foreign key tests do the association first and then the filtering. The complex filtering operation is performed over the fact table, which means that the computation amount is 10 times more than that of filtering the dimension table directly. So it takes the longest to execute the whole query. But the pre-association is faster than the numberized foreign key because it doesn’t need to do the association any more during the query.
Both the index recreation and index reuse texts do the filtering over the dimension table first and then joining with the fact table. The complex filtering operation is executed on the rows of dimension table, so these two tests are faster than the above two mentioned methods. Though index recreation and index reuse require the same computation amount of filtering, association and sum, the latter is faster because it doesn’t need to create an index. However, as the data size after dimension table filtering becomes smaller and small, it will take less time to recreate an index, resulting in narrower execution time differences.
Alignment base sequence building test performs filtering first on the rows of dimension table and then on the fact table according to the built alignment base sequence without joining with the fact table, the index creation and hash values, therefore, it is the fastest among all the optimization skills.
This time we make the 75-million-records orders as the dimension table and the 3-billion-record lineitem as the fact table.
The filtering conditions on the dimension table are “left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price”. The goal is to calculate the total orders amount meeting these conditions. The first three conditions are always true (to increase the computation amount for the purpose of emphasizing the differences in tests). Parameter price is used to test the query speeds with different proportions of dimension data size and fact data size.
The SPL script:
A | |
---|---|
1 | >orders=file("/home/ctx/orders.ctx").open().memory().keys@i(O_ORDERKEY) |
2 | =now() |
3 | =file("/home/ctx/lineitem.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE) |
4 | =A3.switch@i(L_ORDERKEY,orders) |
5 | =A4.select(left(L_ORDERKEY.O_ORDERPRIORITY,2)!="9-" && L_ORDERKEY.O_ORDERSTATUS!="A" && L_ORDERKEY.O_ORDERDATE>date("1990-01-01") && L_ORDERKEY.O_TOTALPRICE>price) |
6 | =A5.total(sum(L_EXTENDEDPRICE)) |
7 | =interval@s(A2,now()) |
A1 loads the dimension table in memory and creates an index on its key, which is not counted in the test time. The execution time starts from A2. Since the fact table size is huge, we retrieve data from it with a cursor and join it with the dimension table before performing the filtering.
SPL script:
A | |
---|---|
1 | >orders=file("/home/ctx/orders.ctx").open().memory().keys@i(O_ORDERKEY) |
2 | =now() |
3 | =orders.select(left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price).derive@o().keys@i(O_ORDERKEY) |
4 | =file("/home/ctx/lineitem.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE).switch@i(L_ORDERKEY,A3) |
5 | =A4.total(sum(L_EXTENDEDPRICE)) |
6 | =interval@s(A2,now()) |
The orders table are filtered and then an index is created on it in A3.
We just change the code of A3 in the above script into:
=orders.select@i(left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price)
The @i option is added to select in order to reuse orders table’s index.
First load the numberized orders table, which is the composite table orders_xh.ctx, without creating index on it.
SPL script:
A | |
---|---|
1 | >orders=file("/home/ctx/orders_xh.ctx").open().memory() |
2 | =now() |
3 | =file("/home/ctx/lineitem_xh.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE) |
4 | =A3.switch@i(L_ORDERKEY,orders:#) |
5 | =A4.select(left(L_ORDERKEY.O_ORDERPRIORITY,2)!="9-" && L_ORDERKEY.O_ORDERSTATUS!="A" && L_ORDERKEY.O_ORDERDATE>date("1990-01-01") && L_ORDERKEY.O_TOTALPRICE>price) |
6 | =A5.total(sum(L_EXTENDEDPRICE)) |
7 | =interval@s(A2,now()) |
orders:# in A4 is used to associate L_ORDERKEY values with the row numbers of orders table.
First load the numberized orders table, which is the composite table orders_xh.ctx, without creating index on it.
SPL script:
A | |
---|---|
1 | >orders=file("/home/ctx/orders_xh.ctx").open().memory() |
2 | =now() |
3 | =orders.(left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price) |
4 | =file("/home/ctx/lineitem_xh.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE).select(A3(L_ORDERKEY)) |
5 | =A4.total(sum(L_EXTENDEDPRICE)) |
6 | =interval@s(A2,now()) |
The code explanation is the same as that of whole RAM computation.
Here are results of the above test scripts (Unit: second):
Number of records of filtered dimension table | 64.43 million | 49.95 million | 35.90 million | 22.49 million | 4.28 million |
---|---|---|---|---|---|
Association & filtering | 101 | 98 | 97 | 94 | 92 |
Index recreation | 102 | 98 | 92 | 73 | 53 |
Index reuse | 85 | 82 | 77 | 74 | 57 |
Numberized foreign key | 79 | 78 | 76 | 75 | 72 |
Alignment base sequence building | 53 | 49 | 47 | 43 | 39 |
In these tests, the records in fact table (3 billion) is four times as many as those in dimension table (75 million).
The query analysis is the same as that for the previous section of tests, but the proportion of the fact table size and the dimension table size decreases from 10 times to 4 times. In this case the speed difference between the numberized foreign key and the index reuse becomes quite small. The numberized foreign key skill is even faster when the number of records filtered away from the dimension table is small because the numberization-based association is more efficient than hash values comparison.
Based on the previous analysis, we can come to a conclusion about which technique we should use to speed up dimension data filtering and computation, and here are the instructions:
1.When the fact table size is smaller than the dimension table size
-
Perform pre-association if all data can fit into memory.
-
If the data can’t fit into memory but the dimension table key and the fact table’s foreign key are already numberized, first join the fact table and the dimension table by numberized key values and then filter the fact table.
-
If the data can’t fit into memory and the dimension table key and the fact table’s foreign key are not numberized, first join the fact table and the dimension table via foreign key and then filter the fact table.
2.When the fact table size is much larger than the dimension table size
-
If the fact table’s foreign key is already numberized, create an alignment base sequence according to which the fact table is corresponded.
-
If the fact table’s foreign key isn’t numberized, first filter the dimension table and reuse the old index, then perform association by the foreign key.
3.When the fact table size is slightly larger than the dimension table size
-
If the fact table’s foreign key is already numberized, create an alignment base sequence according to which the fact table is corresponded.
-
If the fact table’s foreign key isn’t numberized, we’d better do a test to confirm which skill is faster: pre-association (if the data can fit into memory) or index reuse.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code