SPL computing performance test series: multi-index aggregating

 

I. Test task

In modern business intelligence, the index analysis page often presents multiple indexes at the same time. These indexes are mostly calculated from the same dataset (such as a wide table). This test is to aggregate (group and aggregate) the desired metric fields by multiple (possibly dozens) different dimensions based on the same filter condition.

Based on the wide table described in SPL computing performance test series: associate tables and wide table, we test the performance in calculating one index, two index and three indexes respectively. SPL provides the multipurpose traversal mechanism, which can accomplish multiple grouping and aggregating operations in one traversal, and the time to calculate N indexes is usually not N times that of calculating one index.

II. Technologies to be compared

In this test, we only test SPL Enterprise Edition (version 20230528), and select the following two products to make a comparison:

  1. Clickhouse 23.3.1, which is said to be the fastest OLAP database in the world
  2. Starrocks 3.0.0, which is claimed to be a faster OLAP database

III. Test environment

One physical server with configurations as follows:

2 x Intel3014 CPUs, main frequency 1.7G, 12 cores in total
64G memory
SSD (Solid State Drive)

In order to be able to test the computing ability of these products on external storage as well as their sensitivity to memory, we use virtual machines to limit the number of CPUs and the capacity of memory, and design two test environments based on the relatively common cloud VM specifications in the industry:

VM1: 8 CPUs, 32G memory
VM2: 4 CPUs, 16G memory

For Starrocks, at least two nodes, BE and FE, need to be installed. The BE that undertakes computing task is installed on one VM, while the FE that undertakes management task is installed on the physical machine to avoid affecting the test results.

For SPL and Clickhouse, we only need to install them on VM.

IV. Test process

1. One index

This operation is the same as the one on wide table in SPL computing performance test series: associate tables and wide table .

select
    s_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by s_nationname

SPL script:

A
1 =now()
2 =file(“widetable.ctx”).open().cursor@mv(S_NATIONNAME,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,“*xxx*yyy*”))
3 =A2.groups(S_NATIONNAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):volume)
4 =interval@ms(A1,now())

2. Two indexes

select
    s_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by s_nationname
union all
select
    c_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by c_nationname

For the convenience of returning all results in one go, this code uses the union all on the result set of multiple indexes. In practice, the results will be returned separately, and the use of union all here will not affect the performance test.

SPL script:

A B
1 =now()
2 =file(“widetable.ctx”).open().cursor@mv(S_NATIONNAME,C_NATIONNAME,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,“*xxx*yyy*”)).derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):volume)
3 cursor A2 =A3.groups@u(S_NATIONNAME:gid;sum(volume):volume)
4 cursor =A4.groups@u(C_NATIONNAME:gid;sum(volume):volume)
5 =A3|A4
6 =interval@ms(A1,now())

3. Three indexes

select
    s_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by s_nationname
union all
select
    c_nationname,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by c_nationname
union all
select
    p_type,
    sum( l_extendedprice * (1 - l_discount) ) as volume
from widetable
where
    s_comment not like '%xxx%yyy%'
    and o_totalprice>5
    and length(p_type) > 2
    and c_nationname is not null
    and s_nationname is not null
    and c_phone is not null
group by p_type

SPL script:

A B
1 =now()
2 =file(“widetable.ctx”).open().cursor@mv(S_NATIONNAME,C_NATIONNAME,P_TYPE,L_EXTENDEDPRICE,L_DISCOUNT;O_TOTALPRICE>5 && C_NATIONNAME!=null && C_PHONE!= null && S_NATIONNAME!=null && len(P_TYPE)>2 && !like(S_COMMENT,“*xxx*yyy*”)).derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):volume)
3 cursor A2 =A3.groups@u(S_NATIONNAME:gid;sum(volume):volume)
4 cursor =A4.groups@u(C_NATIONNAME:gid;sum(volume):volume)
5 cursor =A5.groups@u(P_TYPE:gid;sum(volume):volume)
6 =A3|A4|A5
7 =interval@ms(A1,now())

4. Three indexes after associating in SPL

The results of SPL computing performance test series: associate tables and wide table show that SPL has better association performance, so we add one test item: associate in SPL first and then calculate three indexes.

SPL script:

A B
1 =now()
2 =file(“nation.btx”).import@bv(N_NAME).(if(N_NAME,N_NAME,null))
3 =file(“customer.ctx”).open().import@mv(C_NATIONKEY,C_PHONE).(if(A2(C_NATIONKEY) && C_PHONE,C_NATIONKEY,null))
4 =file(“supplier.ctx”).open().import@mv(S_NATIONKEY,S_COMMENT).(if(A2(S_NATIONKEY) && !like(S_COMMENT,“*xxx*yyy*”),S_NATIONKEY,null))
5 =file(“part.ctx”).open().import@mv(P_TYPE).(if(len(P_TYPE)>2,P_TYPE,null))
6 =file(“orders.ctx”).open().cursor@mv(O_ORDERKEY,O_CUSTKEY;A3(O_CUSTKEY) && O_TOTALPRICE>5)
7 =file(“lineitem.ctx”).open().news(A6,L_SUPPKEY,L_PARTKEY,O_CUSTKEY,L_EXTENDEDPRICE,L_DISCOUNT;A5(L_PARTKEY) && A4(L_SUPPKEY)).derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):volume)
8 cursor A7 =A8.groups@u(A2(A4(L_SUPPKEY)):gid;sum(volume):volume)
9 cursor =A9.groups@u(A2(A3(O_CUSTKEY)):gid;sum(volume):volume)
10 cursor =A10.groups@u(A5(L_PARTKEY):gid;sum(volume):volume)
11 =A8|A9|A10
12 =interval@ms(A1,now())

V. Test results

Unit: seconds

VM1 VM2
Index number 1 2 3 1 2 3
SPL 57.7 61.6 64.6 114.2 119.5 124.1
Starrocks 62.2 104.6 156.2 135.7 253.6 402.6
Clickhouse 34.7 69.0 106.4 77.4 156.0 249.6
Associate in SPL 49.5 100.5

VI. Comments on test results

1. The effect of SPL’s multipurpose traversal mechanism is obvious. When the number of indexes increases, the calculation time increases only a little, and the dataset is not traversed many times.

2. Neither of the two SQL products has the ability to optimize. The time to calculate N indexes is close to N times the time to calculate one index, indicating that the dataset is likely to have been traversed multiple times.

3. When calculating one index, the traversing performance of SPL on wide table is not as good as that of Clickhouse. However, after the multipurpose traversal technology is utilized, SPL is better than Clickhouse when calculating two indexes, and much better when involving three indexes. In practice, it often involves calculating more than a dozen or even dozens of indexes simultaneously, and the performance gap will be huge.

4. The test result for the sensitivity to memory is basically the same with that in SPL computing performance test series: associate tables and wide table, that is, SPL is less sensitive to memory, while the two SQL-based products are more sensitive. The extent in performance decrease of the two products surpasses that in number decrease of CPU cores, indicating that they are more easily affected by memory capacity.

5. Conclusions

SPL, which provides multipurpose traversal optimization mechanism, is more suitable for multi-index aggregating scenarios than SQL-based products, and has huge advantages when involving more indexes.