Performance optimization case course: TPCH-Q14
select
100.00 * sum(
case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount)
else 0 end)
/ sum(l_extendedprice * (1 - l_discount) ) as promo_revenue
from
lineitem,
part
where
l_partkey = p_partkey
and l_shipdate >= date '1995-04-01'
and l_shipdate < date '1995-04-01' + interval '1' month;
This is an aggregate operation on the filtered result set of two-table association.
1. Data storage
There is no special requirement for tables lineitem and part, they are stored in order by primary key.
We can continue to use lineitem.ctx and part.ctx from Q3.
Copy these tables to the main directory of this query.
2. General method
This is a regular association-based sum calculation. We can handle it by making full use of the parallel processing.
A |
|
1 |
=now() |
2 |
1995-4-1 |
3 |
=elapse@m(A2,1) |
4 |
=file("part.ctx").open().cursor@m(P_PARTKEY,P_TYPE).fetch().keys@i(P_PARTKEY) |
5 |
=file("lineitem.ctx").open().cursor@m(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=A2 &&L_SHIPDATE<A3,L_PARTKEY:A4) |
6 |
=A5.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT),L_DISCOUNT=if(pos@h(L_PARTKEY.P_TYPE,"PROMO"),L_EXTENDEDPRICE,0)) |
7 |
=A6.total(sum(L_DISCOUNT),sum(L_EXTENDEDPRICE)) |
8 |
=100.00*A7(1)/A7(2) |
9 |
=interval@ms(A1,now()) |
The code utilizes optimization methods mentioned in previous articles, such as dimension table attributization, multi-thread parallel processing.
Test result:
Test items |
Execution time (seconds) |
General method |
103 |
Since the part table contains many records, and p_type is a long string field, it consumes a large amount of memory space when loading them all into memory. After loading the part into memory and joining it with lineitem, the number of string comparisons is roughly the same as the number of records in lineitem. Because the number of records in lineitem is very large, the performance is poor.
3. Calculate dimension table in advance
The string comparison in the previous section is performed after joining two tables. In fact, this is a calculation on dimension table field, which can be done in advance. Coding in SQL is roughly like:
select
100.00 * sum(
case when p_type then l_extendedprice * (1 - l_discount)
else 0 end)
/ sum(l_extendedprice * (1 - l_discount) ) as promo_revenue
from
lineitem,
(select
p_partkey,
(case when p_type like 'PROMO%' then true else false end) as p_type
from
part
) p
where
l_partkey = p_partkey
and l_shipdate >= date '1995-04-01'
and l_shipdate < date '1995-04-01' + interval '1' month;
Calculation code:
A |
|
1 |
=now() |
2 |
1995-4-1 |
3 |
=elapse@m(A2,1) |
4 |
=file("part.ctx").open().cursor@m(P_PARTKEY,P_TYPE).run(P_TYPE=pos@h(P_TYPE,"PROMO")).fetch().keys@i(P_PARTKEY) |
5 |
=file("lineitem.ctx").open().cursor@m(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=A2 &&L_SHIPDATE<A3,L_PARTKEY:A4) |
6 |
=A5.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT),L_DISCOUNT=if(L_PARTKEY.P_TYPE,L_EXTENDEDPRICE,0)) |
7 |
=A6.total(sum(L_DISCOUNT),sum(L_EXTENDEDPRICE)) |
8 |
=100.00*A7(1)/A7(2) |
9 |
=interval@ms(A1,now()) |
A4 performs string comparison calculation, and the number of calculations is the same as the number of records in part, but much smaller than the number of records in lineitem. What is finally read into memory is 1 or null, occupyingmuchlessmemory than string.
There is no need to do string comparison calculation in A5.
Test result:
Test items |
Execution time (seconds) |
General method |
103 |
Calculate dimension table in advance |
8 |
4. Data conversion
Utilize the methods mentioned in previous articles: convert date to integer, convert primary key of dimension table to sequence number.
We can directly use part_2.ctx and p_type.btx from Q2.
Copy these tables to the main directory of this query.
The lineitem_4.ctx converted in Q8 has implemented sequence-numberization of l_partkey, but did not implement integerization of the date l_shipdate, so it needs to be converted here.
Code for data conversion:
A |
|
1 |
=file("lineitem_4.ctx").open().cursor().run(L_SHIPDATE=days@o(L_SHIPDATE)) |
2 |
=file("lineitem_14_4.ctx").create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT) |
3 |
>A2.append(A1) |
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-4-1 |
PROMO |
3 |
=days@o(elapse@m(A2,1)) |
=days@o(A2) |
4 |
=p_type=file("p_type.btx").import@b().(pos@h(_1,B2)) |
|
5 |
=file("part_2.ctx").open() |
=A5.cursor@m().skip().(false) |
6 |
=A5.cursor@m(P_PARTKEY;A4(P_TYPE)).fetch().(B5(P_PARTKEY)=true) |
|
7 |
=file("lineitem_14_4.ctx").open().cursor@m(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=B3 && L_SHIPDATE<A3) |
|
8 |
=A7.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT),L_DISCOUNT=if(B5(L_PARTKEY),L_EXTENDEDPRICE,0)) |
|
9 |
=A8.total(sum(L_DISCOUNT),sum(L_EXTENDEDPRICE)) |
|
10 |
=100.00*A9(1)/A9(2) |
A4 and B5 are alignment sequence.
The p_type field of the part table has been converted to integer. A6 will no longer retrieve the string value of the original p_type field, but the p_partkey integer value that meet the condition of the new p_type field. As a result, both the reading of external storage and the memory usage are greatly decreased, and performance is significantly improved.
Test result:
Test items |
Execution time (seconds) |
General method |
103 |
Calculate dimension table in advance |
8 |
Data conversion |
6 |
5. Column-wise computing
A |
B |
|
1 |
=now() |
|
2 |
1995-4-1 |
PROMO |
3 |
=days@o(elapse@m(A2,1)) |
=days@o(A2) |
4 |
=p_type=file("p_type.btx").import@b().(pos@h(_1,B2)) |
|
5 |
=file("part_2.ctx").open() |
=A5.cursor@m().skip().(false) |
6 |
=A5.cursor@m(P_PARTKEY;A4(P_TYPE)).fetch().(B5(P_PARTKEY)=true) |
|
7 |
=file("lineitem_3.ctx").open().cursor@mv(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=B3 && L_SHIPDATE<A3) |
|
8 |
=A7.derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp,if(B5(L_PARTKEY),dp,0):dp1) |
|
9 |
=A8.total(sum(dp1),sum(dp)) |
|
10 |
=100.00*A9(1)/A9(2) |
Column-wise computing cannot assign value to field, so A8 changes the assignment statement in the previous section to new fields dp and dp1.
Test result:
Test items |
Execution time (seconds) |
General method |
103 |
Calculate dimension table in advance |
8 |
Data conversion |
6 |
Column-wise computing |
3 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version