Performance optimization case course: TPCH-Q20
select
s_name,s_address
from
supplier,nation
where
s_suppkey in (
select
ps_suppkey
from
partsupp
where
ps_partkey in (
select
p_partkey
from
part
where
p_name like 'bisque%'
)
and ps_availqty > (
select
0.5 * sum(l_quantity)
from
Lineitem
where
l_partkey = ps_partkey
and l_suppkey = ps_suppkey
and l_shipdate >= date '1995-01-01'
and l_shipdate < date '1995-01-01' + interval '1' year
)
)
and s_nationkey = n_nationkey
and n_name = 'CHINA'
order by
s_name;
The main query is simple, but the nested filtering condition is complex.
The filtering condition is mainly for the S_SUPPKEY field of the supplier table. Since this table is relatively small, it is easy to get S_NAME and S_ADDRESS as long as the SUPPKEY set meeting the condition is calculated. Calculating the SUPPKEY set in steps can make the problem-solving idea clearer.
The original SQL statement is to filter supplier with partsupp. Now we modify it to filter partsupp with supplier, the obtained PS _ SUPPKEY set is the same. The partsupp is relatively large, so regarding supplier and part as dimension tables to filter the fact table partsupp is more conducive to improving performance. Calculation steps:
1. Filter the nation table according to the condition, and then filter the supplier table with the result.
2. Filter the part table according to the condition.
3. Take the results of steps 1 and 2 as foreign key tables to match and filter partsupp table.
These three steps can be viewed as a view v1, expressed in SQL as follows:
select
ps_suppkey
from
partsupp,
(
select
p_partkey
from
part
where
p_name like 'bisque%'
),
(
select
s_suppkey
from
supplier,nation
where
s_nationkey = n_nationkey
and n_name = 'CHINA'
)
where
ps_partkey=p_partkey
and ps_suppkey=s_suppkey
With v1, the original SQL statement can be modified as:
select
s_name,s_address
from
supplier,
(
select
distinct ps_suppkey
from
v1
where
ps_availqty > (
select
0.5 * sum(l_quantity)
from
lineitem
where
l_partkey = ps_partkey
and l_suppkey = ps_suppkey
and l_shipdate >= date '1995-01-01'
and l_shipdate < date '1995-01-01' + interval '1' year
)
) q_suppkey
where s_suppkey=ps_suppkey;
Observe the subquery q_suppkey, where the calculationsof v1 and lineitem fall under the situation of associating with the primary table according to the equivalence condition. We can use the method explained in previous articles to change the calculations to a JOIN computation between v1 and lineitem. After modification, the SQL statement can be rewritten as:
select
s_name,s_address
from
supplier,
(
select
distinct ps_suppkey
from
v1,
(select
l_partkey,
l_suppkey,
0.5 * sum(l_quantity) quantity
from
lineitem
where
l_shipdate >= date '1995-01-01'
and l_shipdate < date '1995-01-01' + interval '1' year
group by
l_partkey,
l_suppkey
)
where
ps_partkey = l_partkey
and ps_suppkey = l_suppkey
and ps_availqty > quantity
) q_suppkey
where s_suppkey=ps_suppkey;
Next, we need to continue to modify the subquery q_suppkey. Since the lineitem table is very large, it is necessary to first associate it with v1 to filter out records that do not satisfy the condition, then perform grouping and aggregation, and finally filter out the same L_SUPPKEY set. Rewrite the SQL statement as:
select
s_name,s_address
from
supplier,
(
select
distinct l_suppkey
from
(select
l_partkey,
l_suppkey,
ps_availqty,
0.5 * sum(l_quantity) quantity
from
lineitem,v1
where
ps_partkey = l_partkey
and ps_suppkey = l_suppkey
and l_shipdate >= date '1995-01-01'
and l_shipdate < date '1995-01-01' + interval '1' year
group by
l_partkey,
l_suppkey,
ps_availqty
)
where
ps_availqty > quantity
) suppkey
where s_suppkey=l_suppkey;
1. Data storage
There is no special requirement for tables involved in calculation, store them in order by primary key.
Continue to use lineitem.ctx from Q3, part.ctx, supplier.ctx, partsupp.ctx and nation.btx from Q2. Copy them to the main directory of this query.
2. General method
Calculation code:
A |
|
1 |
=now() |
2 |
1995-1-1 |
3 |
=elapse@y(A2,1) |
4 |
>partname="bisque" |
5 |
>nationname="CHINA" |
6 |
=file("nation.btx").import@b().select@1(N_NAME==nationname).N_NATIONKEY |
7 |
=file("part.ctx").open().cursor@m(P_PARTKEY;pos@h(P_NAME,partname)).fetch().keys@im(P_PARTKEY) |
8 |
=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME,S_ADDRESS;S_NATIONKEY==A6).fetch().keys@im(S_SUPPKEY) |
9 |
=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;A7.find(PS_PARTKEY),A8.find(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY) |
10 |
=file("lineitem.ctx").open().cursor@m(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A2 && L_SHIPDATE<A3) |
11 |
=A10.join@i(L_PARTKEY:L_SUPPKEY,A9:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY) |
12 |
=A11.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity) |
13 |
=A12.select(PS_AVAILQTY*2>quantity).id(L_SUPPKEY) |
14 |
=A8.join@im(S_SUPPKEY,A13:~) |
15 |
=A14.new(S_NAME,S_ADDRESS).sort(S_NAME) |
16 |
=interval@ms(A1,now()) |
A9 is equivalent to calculating the view v1 mentioned above.
A8 retrieves S_NAME and S_ADDRESS while retrieving S_SUPPKEY from the supplier table, which avoids reading them again in A14.
A13 is equivalent to calculating the subquery q_suppkey.
A12 uses the redundant grouping key introduced in previous articles, and PS_AVAILQTY is no longer used as a grouping field, which improves performance.
A13 changes sum(L_QUANTITY)*0.5 to PS_AVAILQTY*2. Both L_QUANTITY and PS_AVAILQTY are integers, which changes decimal multiplication into integer multiplication.
A14 uses join@im, where @i option deletes non-matching records and @m option enables using order-based merge to speed up computation when both S_SUPPKEY and A13 are ordered.
Test result:
Test items |
Execution time (seconds) |
General method |
15 |
3. Data conversion
For this query, we need to use two optimization methods mentioned in previous articles: dimension table primary key sequence-numberization and date-integer conversion. Copy part_2.ctx, supplier_2.ctx, partsupp_2.ctx and nation_2.btx from Q2 and lineitem_14_4.ctx from Q14 to the main directory of this query.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-1-1 |
|
3 |
=days@o(elapse@y(A2,1)) |
|
4 |
=days@o(A2) |
|
5 |
>partname="bisque" |
|
6 |
>nationname="CHINA" |
|
7 |
=file("nation_2.btx").import@b().(N_NAME==nationname) |
|
8 |
=file("part_2.ctx").open() |
=A8.cursor().skip().(false) |
9 |
=A8.cursor@m(P_PARTKEY;pos@h(P_NAME,partname)).fetch().run(B8(P_PARTKEY)=true) |
|
10 |
=file("supplier_2.ctx").open() |
=A10.cursor().skip().(null) |
11 |
=A10.cursor@m(S_SUPPKEY,S_NAME,S_ADDRESS;A7(S_NATIONKEY)).fetch() |
=A11.run(B10(S_SUPPKEY)=~) |
12 |
=file("partsupp_2.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;B8(PS_PARTKEY) && B10(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY) |
|
13 |
=file("lineitem_14_4.ctx").open().cursor@m(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A4 && L_SHIPDATE<A3) |
|
14 |
=A13.join@i(L_PARTKEY:L_SUPPKEY,A12:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY) |
|
15 |
=A14.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity) |
|
16 |
=A15.select(PS_AVAILQTY*2>quantity).id(L_SUPPKEY) |
|
17 |
=B10(A16) |
|
18 |
=A17.new(S_NAME,S_ADDRESS).sort(S_NAME) |
|
19 |
=interval@ms(A1,now()) |
A7, B8, B10 and B11 are all aligned sequences. The value of A10 is null or a supplier record, which enables the direct use of the record in A17, without having to read it again.
Test result:
Test items |
Execution time (seconds) |
General method |
15 |
Data conversion |
10 |
4. Column-wise computing
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-1-1 |
|
3 |
=days@o(elapse@y(A2,1)) |
|
4 |
=days@o(A2) |
|
5 |
>partname="bisque" |
|
6 |
>nationname="CHINA" |
|
7 |
=file("nation_2.btx").import@b().select@1(N_NAME==nationname).N_NATIONKEY |
|
8 |
=file("part_2.ctx").open() |
=A8.cursor@m().skip().(false) |
9 |
=A8.cursor@mv(P_PARTKEY;pos@h(P_NAME,partname)).fetch().run(B8(P_PARTKEY)=true) |
|
10 |
=file("supplier_2.ctx").open() |
=A10.cursor@m().skip().(null) |
11 |
=A10.cursor@mv(S_SUPPKEY,S_NAME,S_ADDRESS;S_NATIONKEY==A7).fetch() |
=A11.run(B10(S_SUPPKEY)=~) |
12 |
=file("partsupp_2.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;B8(PS_PARTKEY) && B10(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY) |
|
13 |
=file("lineitem_14_4.ctx").open().cursor@mv(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A4 && L_SHIPDATE<A3) |
|
14 |
=A13.join@i(L_PARTKEY:L_SUPPKEY,A12:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY) |
|
15 |
=A14.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity) |
|
16 |
=A15.select@v(PS_AVAILQTY*2>quantity).id(L_SUPPKEY) |
|
17 |
=B10(A16) |
|
18 |
=A17.new(S_NAME,S_ADDRESS).sort(S_NAME) |
|
19 |
=interval@ms(A1,now()) |
Test result:
Test items |
Execution time (seconds) |
General method |
15 |
Data conversion |
10 |
Column-wise computing |
6 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version