Performance Optimization Exercises Using TPC-H – Q21
Ⅰ SQL code and analysis
Below is the SQL query statement:
select * from (
select
s_name,
count(*) as numwait
from
supplier,lineitem l1,orders,nation
where
s_suppkey = l1.l_suppkey
and o_orderkey = l1.l_orderkey
and o_orderstatus = 'F'
and l1.l_receiptdate > l1.l_commitdate
and exists (
select
*
from
lineitem l2
where
l2.l_orderkey = l1.l_orderkey
and l2.l_suppkey <> l1.l_suppkey
)
and not exists (
select
*
from
lineitem l3
where
l3.l_orderkey = l1.l_orderkey
and l3.l_suppkey <> l1.l_suppkey
and l3.l_receiptdate > l3.l_commitdate
)
and s_nationkey = n_nationkey
and n_name = 'CHINA'
group by
s_name
order by
numwait desc,
s_name
) where rownum<=100;
The main query is a simple grouping & aggregation operation on the result set of primary-sub table association, with a complex filtering condition containing two exists clauses.
Ⅱ SPL solution
Both exists clauses perform computation on lineitem records under same l_orderkey.
We know that lineitem table is already ordered by l_orderkey. Besides, records of the sub table can be regarded as a set type field of the primary table. If we perform order-based grouping (but without aggregation) on result set of the association between orders table and lineitem table on orderkey, we get a series of small sets of lineitem records having same l_orderkey value. Then it becomes simple to compute the two exists conditions on each small set.
A |
|
1 |
=now() |
2 |
>name="CHINA" |
3 |
=file("nation.btx").import@b().select@1(N_NAME==name).N_NATIONKEY |
4 |
=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME;S_NATIONKEY==A3).fetch().keys@im(S_SUPPKEY) |
5 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY;O_ORDERSTATUS=="F") |
6 |
=file("lineitem.ctx").open().news@r(A5,L_ORDERKEY:ORDERKEY;A4.find(L_SUPPKEY)) |
7 |
=file("lineitem.ctx").open().news(A6:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_RECEIPTDATE,L_COMMITDATE) |
8 |
=A7.group(L_ORDERKEY) |
9 |
=A8.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.#2) && !t.pselect(#2!=s) && ~.pselect(#2!=s),t,null) ) |
10 |
=A9.switch@i(L_SUPPKEY,A4) |
11 |
=A10.groups@u(L_SUPPKEY.S_NAME:s_name;count(1):numwait) |
12 |
=A11.top(100;-numwait,s_name) |
13 |
=interval@ms(A1,now()) |
A6 association orders table and lineitem table, filters association result and selects eligible ORDERKEY values. A7 associates with A6 to select the desired fields. A8 performs order-based grouping on A7 to divide it into sets of lineitem records having same l_orderkey values. A9 judges each of these small sets according to exists conditions and filters away the ineligible records. The rest of the code associates with the other foreign key table and performs the regular grouping operation on A10.
Ⅲ Further optimization
1. Optimization method
In this example, we will use date-integer conversion optimization method explained in Q1 – lineitem table’s L_COMMITDATE field and L_RECEIPTDATE field have been converted in the previous examples, string-integer conversion method explained in the same essay – here orders table’s O_ORDERSTATUS field needs to be converted, and dimension table primary key numberization method explained in Q2 – supplier table’s S_SUPPKEY field and lineitem table’s L_SUPPKEY field are already converted in the previous examples.
2. Code for data conversion
2.1 Conversion on nation table, supplier table and lineitem table
Copy nation_20.btx, supplier_20.ctx and lineitem_20.ctx, and rename them nation_21.btx, supplier_21.ctx and lineitem_21.ctx respectively.
2.2 Conversion on orders table
A |
|
1 |
=file("orders.ctx").open().cursor(O_ORDERSTATUS) |
2 |
=A1.id(O_ORDERSTATUS).sort() |
3 |
=file("o_orderstatus.txt").export(A2) |
4 |
=file("orders_18.ctx").open().cursor() |
5 |
=A4.run(O_ORDERSTATUS=A2.pos@b(O_ORDERSTATUS)) |
6 |
=file("orders_21.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT) |
7 |
>A6.append(A5) |
3. Code after data conversion
The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:
A |
|
1 |
>env(o_orderstatus,file("o_orderstatus.txt").import@si()) |
2 |
>env(nation, file("nation_21.btx").import@b()) |
3 |
>env(supplier, file("supplier_21.ctx").open().import()) |
Before performing the query, we need to first run the preloading code to load small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
>name="CHINA" |
3 |
=o_orderstatus.pos@b("F") |
4 |
=nation.select@1(N_NAME==name).N_NATIONKEY |
5 |
=supplier.@m(if(S_NATIONKEY==A4,S_NAME,null)) |
6 |
=file("orders_21.ctx").open().cursor@m(O_ORDERKEY;O_ORDERSTATUS==A3) |
7 |
=file("lineitem_21.ctx").open().news@r(A6,L_ORDERKEY:ORDERKEY;A5(L_SUPPKEY)) |
8 |
=file("lineitem_21.ctx").open().news(A7:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_COMMITDATE,L_RECEIPTDATE) |
9 |
=A8.group(L_ORDERKEY) |
10 |
=A9.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.#2) && !t.pselect(#2!=s) && ~.pselect(#2!=s),t.select(A5(L_SUPPKEY)),null) ) |
11 |
=A10.groups@u(L_SUPPKEY:s_name;count(1):numwait).run@m(s_name=A5(s_name)) |
12 |
=A11.top(100;-numwait,s_name) |
13 |
=interval@ms(A1,now()) |
Ⅳ Using enterprise edition’s column-wise computation
1. Original data
A |
|
1 |
=now() |
2 |
>name="CHINA" |
3 |
=file("nation.btx").import@b().select@1(N_NAME==name).N_NATIONKEY |
4 |
=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NAME;S_NATIONKEY==A3).fetch().keys@im(S_SUPPKEY) |
5 |
=file("orders.ctx").open().cursor@mv(O_ORDERKEY;O_ORDERSTATUS=="F") |
6 |
=file("lineitem.ctx").open().news@r(A5,L_ORDERKEY:ORDERKEY,min(L_SUPPKEY):SK,count(1):cnt; A4.find(L_SUPPKEY) && L_RECEIPTDATE>L_COMMITDATE) |
7 |
=file("lineitem.ctx").open().news@r(A6:ORDERKEY,L_ORDERKEY,SK,cnt,count(L_RECEIPTDATE>L_COMMITDATE && SK!=L_SUPPKEY):c1,count(SK!=L_SUPPKEY):c2).select@v(c1==0 && c2!=0) |
8 |
=A7.groups@u(SK;sum(cnt):numwait).join(SK,A4,S_NAME).new(S_NAME:s_name,numwait) |
9 |
=A8.top(100;-numwait,s_name) |
10 |
=interval@ms(A1,now()) |
2. Optimized data
The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:
A |
|
1 |
>env(o_orderstatus,file("o_orderstatus.txt").import@si()) |
2 |
>env(nation, file("nation_21.btx").import@bv()) |
3 |
>env(supplier, file("supplier_21.ctx").open().import@v()) |
Before performing the query, we need to first run the preloading code to load small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
>name="CHINA" |
3 |
=o_orderstatus.pos@b("F") |
4 |
=nation.select@1(N_NAME==name).N_NATIONKEY |
5 |
=supplier.(if(S_NATIONKEY==A4,S_NAME,null)) |
6 |
=file("orders_21.ctx").open().cursor@mv(O_ORDERKEY;O_ORDERSTATUS==A3) |
7 |
=file("lineitem_21.ctx").open().news@r(A6,L_ORDERKEY:ORDERKEY,min(L_SUPPKEY):SK,count(1):cnt; A5(L_SUPPKEY) && L_RECEIPTDATE>L_COMMITDATE) |
8 |
=file("lineitem_21.ctx").open().news@r(A7:ORDERKEY,L_ORDERKEY,SK,cnt,count(L_RECEIPTDATE>L_COMMITDATE && SK!=L_SUPPKEY):c1,count(SK!=L_SUPPKEY):c2).select@v(c1==0 && c2!=0) |
9 |
=A8.groups@u(SK;sum(cnt):numwait).new(A5(SK):s_name,numwait) |
10 |
=A9.top(100;-numwait,s_name) |
11 |
=interval@ms(A1,now()) |
Ⅴ Test result
Unit: Second
Regular |
Column-wise |
|
Before optimization |
19.9 |
14.0 |
After optimization |
14.3 |
9.4 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL