Performance optimization case course: TPCH-Q11
select
ps_partkey,
sum(ps_supplycost * ps_availqty) as value
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'CHINA'
group by
ps_partkey
having
sum(ps_supplycost * ps_availqty) > (
select
sum(ps_supplycost * ps_availqty) * 0.000001
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'CHINA'
)
order by
value desc;
If we regard the following sub-query as a view V,
select
ps_partkey,
sum(ps_supplycost * ps_availqty) as value
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'CHINA'
group by
ps_partkey
then the original query statement is equivalent to:
select
ps_partkey,
value
from V
where value>0.000001*(select sum(value) from V)
Since the view V here is already agrouped result set, which contains a relatively small number of records, traversing V is much less computation-intensive than traversing partsupp table directly.
1. Data storage
Store partsupp in order by the primary key ps_partkey. We can adopt order-based grouping method to improve the computing performance of grouping and aggregation.
For other tables, there is no special sorting requirement, they are all stored in order by primary key.
Continue to use partsupp.ctx, supplier.ctx and nation.btx from Q2.
Copy these tables to the main directory of this query.
2. General method
A |
|
1 |
=now() |
2 |
>name="CHINA" |
3 |
>percent=0.000001 |
4 |
=file("nation.btx").import@b().select@1(N_NAME== name).N_NATIONKEY |
5 |
=file("supplier.ctx").open().cursor@m(S_SUPPKEY;S_NATIONKEY==A4).fetch().keys@i(S_SUPPKEY) |
6 |
=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_AVAILQTY,PS_SUPPLYCOST;A5.find(PS_SUPPKEY)) |
7 |
=A6.groups@o(PS_PARTKEY;sum(PS_SUPPLYCOST*PS_AVAILQTY):value) |
8 |
=A7.sum(value)*percent |
9 |
=A7.select(value>A8).sort@z(value) |
10 |
=interval@ms(A1,now()) |
Here we use the optimization methods mentioned in previous articles, such as pre-cursor filtering, multi-thread parallel computing.
The partsupp table is ordered by the primary key PS_PATKEY, so we can use the order-based grouping strategy to perform grouping operation on this field in order to increase performance of computing view V. A7 uses groups@o to perform the order-based grouping, which amounts to computing view V. Then A8 and A9 traverse A7 twice to get the final result.
Test result:
Test items |
Execution time (seconds) |
General method |
7 |
3. Data conversion
Utilize the conversion method mentioned in previous article: convert primary key of dimension table to sequence number.
We can directly use partsupp_2.ctx, supplier_2.ctx and nation_2.btx converted in Q2.
Copy these tables to the main directory of this query.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
>name="CHINA" |
>percent=0.000001 |
3 |
=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME) |
|
4 |
=A3.(N_NAME== name) |
|
5 |
=file("supplier_2.ctx").open() |
=A5.cursor@m().skip().(false) |
6 |
=file("supplier_2.ctx").open().cursor@m(S_SUPPKEY;A4(S_NATIONKEY)).fetch().(B5(S_SUPPKEY)=true) |
|
7 |
=file("partsupp_2.ctx").open().cursor@m(PS_PARTKEY,PS_AVAILQTY,PS_SUPPLYCOST;B5(PS_SUPPKEY)) |
|
8 |
=A7.groups@o(PS_PARTKEY;sum(PS_SUPPLYCOST*PS_AVAILQTY):value) |
|
9 |
=A8.sum(value)*percent |
|
10 |
=A8.select(value>A9).sort@z(value) |
|
11 |
=interval@ms(A1,now()) |
A4 and B5 are aligned sequence.
Test result:
Test items |
Execution time (seconds) |
General method |
7 |
Data conversion |
4 |
4. Column-wise computing
A |
B |
|
1 |
=now() |
|
2 |
>name="CHINA" |
>percent=0.000001 |
3 |
=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME) |
|
4 |
=A3.(N_NAME== name) |
|
5 |
=file("supplier_2.ctx").open() |
=A5.cursor@m().skip().(false) |
6 |
=file("supplier_2.ctx").open().cursor@mv(S_SUPPKEY;A4(S_NATIONKEY)).fetch().(B5(S_SUPPKEY)=true) |
|
7 |
=file("partsupp_2.ctx").open().cursor@mv(PS_PARTKEY,PS_AVAILQTY,PS_SUPPLYCOST;B5(PS_SUPPKEY)) |
|
8 |
=A7.groups@o(PS_PARTKEY;sum(PS_SUPPLYCOST*PS_AVAILQTY):value) |
|
9 |
=A8.sum(value)*percent |
|
10 |
=A8.select(value>A9).sort@z(value) |
|
11 |
=interval@ms(A1,now()) |
Test result:
Test items |
Execution time (seconds) |
General method |
7 |
Data conversion |
4 |
Column-wise computing |
2 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version