Performance optimization case course: TPCH-Q16
select
p_brand,p_type,p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#21'
and p_type not like 'SMALL%'
and p_size in (2, 15, 17, 23, 25, 41, 44, 45)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)
group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size;
This is a grouping & aggregation operation on the filtered result set of multi-table association.
1. Data storage
There is no special requirement for tables partsupp, part and supplier, store them in order by primary key.
Continue to use partsupp.ctx, part.ctx and supplier.ctx from Q2 query.
Copy these tables to the main directory of this query.
2. General method
The query is to match partsupp with the two dimension tables part and supplier, filter and then group, where the grouping operation is based on two dimension table fields.
Here we can use the optimization methods mentioned in previous articles. First, we filter the dimension tables, and then perform foreign key attributization on the primary table. The records that do not matchwiththe foreign key table will be filtered out during the process, and we can directly reference the dimension table fields during the grouping operation.
A |
B |
|
1 |
=now() |
|
2 |
="Brand#21" |
="SMALL" |
3 |
=[2,15,17,23,25,41,44,45] |
|
4 |
=file("part.ctx").open().cursor@m(P_PARTKEY,P_BRAND,P_TYPE,P_SIZE;P_BRAND!=A2 && !pos@h(P_TYPE,B2) && A3.contain@b(P_SIZE)).fetch().keys@im(P_PARTKEY) |
|
5 |
=file("supplier.ctx").open().cursor@m(S_SUPPKEY;!like(S_COMMENT,"*Customer*Complaints*")).fetch().keys@im(S_SUPPKEY) |
|
6 |
=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:A4,PS_SUPPKEY:A5) |
|
7 |
=A6.groups@u(PS_PARTKEY.P_BRAND,PS_PARTKEY.P_TYPE,PS_PARTKEY.P_SIZE;icount(PS_SUPPKEY):supplier_cnt) |
|
8 |
=A7.sort@m(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE) |
|
9 |
=interval@ms(A1,now()) |
A4 and A5 retrieve and filter the dimension tables respectively. A6 creates cursor while performing foreign key join and filtering.
Note that A4 uses contain@b when performing the IN judgement, which means the binary search will be adopted. When there are a lot of members in the set on which the IN judgment works, sorting members and using binary search can reduce the number of comparisons and increase performance.
Test result:
Test items |
Execution time (seconds) |
General method |
18 |
3. Data conversion
For this query, we need to use the dimension table primary key sequence-numberization method. The partsupp_2.ctx and supplier_2.ctx from Q2 query can be used directly. Copy them to the main directory of this query.
In addition, we also need to convert the enumeration string field p_brand to numbers based on part_2.ctx of Q2 query.
Code for data conversion:
A |
|
1 |
=file("part_2.ctx").open().cursor().fetch() |
2 |
=A1.id(P_BRAND).sort() |
3 |
=file("p_brand.btx").export@b(A2) |
4 |
=A1.run(P_PARTKEY=#, P_BRAND=A2.pos@b(P_BRAND)) |
5 |
=file("part_16_3.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT) |
6 |
>A5.append(A4.cursor()) |
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
Brand#21 |
SMALL |
3 |
[2,15,17,23,25,41,44,45] |
|
4 |
=file("p_brand.btx").import@b().(_1) |
=A4.pos@b(A2) |
5 |
=file("p_type.btx").import@b().( !pos@h(_1,B2)) |
|
6 |
=file("part_16_3.ctx").open() |
=A6.cursor@m().skip().(null) |
7 |
=A6.cursor@m(P_PARTKEY,P_BRAND,P_TYPE,P_SIZE;P_BRAND!=B4 && A3.contain@b(P_SIZE) && A5(P_TYPE)).fetch().(B6(P_PARTKEY)=~) |
|
8 |
=file("supplier_2.ctx").open() |
=A8.cursor@m().skip().(false) |
9 |
=A8.cursor@m(S_SUPPKEY;!like(S_COMMENT,"*Customer*Complaints*")).fetch().(B8(S_SUPPKEY)=true) |
|
10 |
=file("partsupp_2.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:B6:#,B8(PS_SUPPKEY)) |
|
11 |
=A4.len()+1 |
|
12 |
=A10.groups@u(PS_PARTKEY.P_TYPE*A11*46+PS_PARTKEY.P_SIZE*A11+PS_PARTKEY.P_BRAND:gk;icount(PS_SUPPKEY):supplier_cnt) |
|
13 |
=A12.new@m(A4(gk%A11):P_BRAND,A5(gk\(A11*46)):P_TYPE,gk%(A11*46)\A11:P_SIZE,supplier_cnt) |
|
14 |
=A13.sort(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE) |
|
15 |
=interval@ms(A1,now()) |
A5, B6 and B8 use the alignment sequence technique explained in previous articles. A12 uses the grouping key technique to convert the three-field based grouping operation to the grouping operation by a single filed gk. A13 then computes the original three grouping fields using gk.
Test result:
Test items |
Execution time (seconds) |
General method |
18 |
Data conversion |
14 |
4. Column-wise computing
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
Brand#21 |
SMALL |
3 |
[2,15,17,23,25,41,44,45] |
|
4 |
=file("p_brand.btx").import@b().(_1) |
=A4.pos@b(A2) |
5 |
=file("p_type.btx").import@b().( !pos@h(_1,B2)) |
|
6 |
=file("part_16_3.ctx").open() |
=A6.cursor@m().skip().(null) |
7 |
=A6.cursor@mv(P_PARTKEY,P_BRAND,P_TYPE,P_SIZE;P_BRAND!=B4 && A3.contain@b(P_SIZE) && A5(P_TYPE)).fetch().(B6(P_PARTKEY)=~) |
|
8 |
=file("supplier_2.ctx").open() |
=A8.cursor@m().skip().(false) |
9 |
=A8.cursor@mv(S_SUPPKEY;!like(S_COMMENT,"*Customer*Complaints*")).fetch().(B8(S_SUPPKEY)=true) |
|
10 |
=file("partsupp_2.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY;PS_PARTKEY:B6:#,B8(PS_SUPPKEY)) |
|
11 |
=A4.len()+1 |
|
12 |
=A10.groups@u(PS_PARTKEY.P_TYPE*A11*46+PS_PARTKEY.P_SIZE*A11+PS_PARTKEY.P_BRAND:gk;icount(PS_SUPPKEY):supplier_cnt) |
|
13 |
=A12.new@m(A4(gk%A11):P_BRAND,A5(gk\(A11*46)):P_TYPE,gk%(A11*46)\A11:P_SIZE,supplier_cnt) |
|
14 |
=A13.sort(-supplier_cnt,P_BRAND,P_TYPE,P_SIZE) |
|
15 |
=interval@ms(A1,now()) |
Test result:
Test items |
Execution time (seconds) |
General method |
18 |
Data conversion |
14 |
Column-wise computing |
5 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version