-
Notifications
You must be signed in to change notification settings - Fork 340
SPL computing performance test series:multi index aggregating
In modern business intelligence, the index analysis page often presents multiple indexes at the same time. These indexes are mostly calculated from the same dataset (such as a wide table). This test is to aggregate (group and aggregate) the desired metric fields by multiple (possibly dozens) different dimensions based on the same filter condition.
Based on the wide table described in SPL computing performance test series: associate tables and wide table, we test the performance in calculating one index, two index and three indexes respectively. SPL provides the multipurpose traversal mechanism, which can accomplish multiple grouping and aggregating operations in one traversal, and the time to calculate N indexes is usually not N times that of calculating one index.
In this test, we only test SPL Enterprise Edition (version 20230528), and select the following two products to make a comparison:
- Clickhouse 23.3.1, which is said to be the fastest OLAP database in the world
- Starrocks 3.0.0, which is claimed to be a faster OLAP database
One physical server with configurations as follows:
- 2 x Intel3014 CPUs, main frequency 1.7G, 12 cores in total
- 64G memory
- SSD (Solid State Drive)
In order to be able to test the computing ability of these products on external storage as well as their sensitivity to memory, we use virtual machines to limit the number of CPUs and the capacity of memory, and design two test environments based on the relatively common cloud VM specifications in the industry:
- VM1: 8 CPUs, 32G memory
- VM2: 4 CPUs, 16G memory
For Starrocks, at least two nodes, BE and FE, need to be installed. The BE that undertakes computing task is installed on one VM, while the FE that undertakes management task is installed on the physical machine to avoid affecting the test results.
For SPL and Clickhouse, we only need to install them on VM.
This operation is the same as the one on wide table in SPL computing performance test series: associate tables and wide table.
select
s_nationname,
sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
s_comment not like '%xxx%yyy%'
and o_totalprice>5
and length(p_type) > 2
and c_nationname is not null
and s_nationname is not null
and c_phone is not null
group by s_nationname
SPL script:
A | |
---|---|
1 | =now() |
2 | =file("widetable.ctx").open().cursor@mv(S_NATIONNAME,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,"*xxx*yyy*")) |
3 | =A2.groups(S_NATIONNAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):volume) |
4 | =interval@ms(A1,now()) |
select
s_nationname,
sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
s_comment not like '%xxx%yyy%'
and o_totalprice>5
and length(p_type) > 2
and c_nationname is not null
and s_nationname is not null
and c_phone is not null
group by s_nationname
union all
select
c_nationname,
sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
s_comment not like '%xxx%yyy%'
and o_totalprice>5
and length(p_type) > 2
and c_nationname is not null
and s_nationname is not null
and c_phone is not null
group by c_nationname
For the convenience of returning all results in one go, this code uses the union all on the result set of multiple indexes. In practice, the results will be returned separately, and the use of union all here will not affect the performance test.
SPL script:
A | B | |
---|---|---|
1 | =now() | |
2 | =file("widetable.ctx").open().cursor@mv(S_NATIONNAME,C_NATIONNAME,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,"*xxx*yyy*")).derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):volume) | |
3 | cursor A2 | =A3.groups@u(S_NATIONNAME:gid;sum(volume):volume) |
4 | cursor | =A4.groups@u(C_NATIONNAME:gid;sum(volume):volume) |
5 | =A3|A4 | |
6 | =interval@ms(A1,now()) |
select
s_nationname,
sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
s_comment not like '%xxx%yyy%'
and o_totalprice>5
and length(p_type) > 2
and c_nationname is not null
and s_nationname is not null
and c_phone is not null
group by s_nationname
union all
select
c_nationname,
sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
s_comment not like '%xxx%yyy%'
and o_totalprice>5
and length(p_type) > 2
and c_nationname is not null
and s_nationname is not null
and c_phone is not null
group by c_nationname
union all
select
p_type,
sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
s_comment not like '%xxx%yyy%'
and o_totalprice>5
and length(p_type) > 2
and c_nationname is not null
and s_nationname is not null
and c_phone is not null
group by p_type
SPL script:
A | B | |
---|---|---|
1 | =now() | |
2 | =file("widetable.ctx").open().cursor@mv(S_NATIONNAME,C_NATIONNAME,P_TYPE,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,"*xxx*yyy*")).derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):volume) | |
3 | cursor A2 | =A3.groups@u(S_NATIONNAME:gid;sum(volume):volume) |
4 | cursor | =A4.groups@u(C_NATIONNAME:gid;sum(volume):volume) |
5 | cursor | =A5.groups@u(P_TYPE:gid;sum(volume):volume) |
6 | =A3|A4|A5 | |
7 | =interval@ms(A1,now()) |
The results of SPL computing performance test series: associate tables and wide table show that SPL has better association performance, so we add one test item: associate in SPL first and then calculate three indexes.
SPL script:
A | B | |
---|---|---|
1 | =now() | |
2 | =file("nation.btx").import@bv(N_NAME).(if(N_NAME,N_NAME,null)) | |
3 | =file("customer.ctx").open().import@mv(C_NATIONKEY,C_PHONE).(if(A2(C_NATIONKEY) && C_PHONE,C_NATIONKEY,null)) | |
4 | =file("supplier.ctx").open().import@mv(S_NATIONKEY,S_COMMENT).(if(A2(S_NATIONKEY) && !like(S_COMMENT,"*xxx*yyy*"),S_NATIONKEY,null)) | |
5 | =file("part.ctx").open().import@mv(P_TYPE).(if(len(P_TYPE)>2,P_TYPE,null)) | |
6 | =file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;A3(O_CUSTKEY) && O_TOTALPRICE>5) | |
7 | =file("lineitem.ctx").open().news(A6,L_SUPPKEY,L_PARTKEY,O_CUSTKEY,L_EXTENDEDPRICE,L_DISCOUNT;A5(L_PARTKEY) && A4(L_SUPPKEY)).derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):volume) | |
8 | cursor A7 | =A8.groups@u(A2(A4(L_SUPPKEY)):gid;sum(volume):volume) |
9 | cursor | =A9.groups@u(A2(A3(O_CUSTKEY)):gid;sum(volume):volume) |
10 | cursor | =A10.groups@u(A5(L_PARTKEY):gid;sum(volume):volume) |
11 | =A8|A9|A10 | |
12 | =interval@ms(A1,now()) |
Unit: seconds
VM1 | VM2 | |||||
---|---|---|---|---|---|---|
Index number | 1 | 2 | 3 | 1 | 2 | 3 |
SPL | 57.7 | 61.6 | 64.6 | 114.2 | 119.5 | 124.1 |
Starrocks | 62.2 | 104.6 | 156.2 | 135.7 | 253.6 | 402.6 |
Clickhouse | 34.7 | 69.0 | 106.4 | 77.4 | 156.0 | 249.6 |
Associate in SPL | 49.5 | 100.5 |
1. The effect of SPL’s multipurpose traversal mechanism is obvious. When the number of indexes increases, the calculation time increases only a little, and the dataset is not traversed many times.
2. Neither of the two SQL products has the ability to optimize. The time to calculate N indexes is close to N times the time to calculate one index, indicating that the dataset is likely to have been traversed multiple times.
3. When calculating one index, the traversing performance of SPL on wide table is not as good as that of Clickhouse. However, after the multipurpose traversal technology is utilized, SPL is better than Clickhouse when calculating two indexes, and much better when involving three indexes. In practice, it often involves calculating more than a dozen or even dozens of indexes simultaneously, and the performance gap will be huge.
4. The test result for the sensitivity to memory is basically the same with that in SPL computing performance test series: associate tables and wide table, that is, SPL is less sensitive to memory, while the two SQL-based products are more sensitive. The extent in performance decrease of the two products surpasses that in number decrease of CPU cores, indicating that they are more easily affected by memory capacity.
5. Conclusions
SPL, which provides multipurpose traversal optimization mechanism, is more suitable for multi-index aggregating scenarios than SQL-based products, and has huge advantages when involving more indexes.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code