Performance optimization case course: TPCH-Q17
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,part
where
p_partkey = l_partkey
and p_brand = 'Brand#33'
and p_container = 'LG DRUM'
and l_quantity < (
select
0.2 * avg(l_quantity)
From
Lineitem
Where
l_partkey = p_partkey
);
Such an association with the primary table according to the equivalence condition in the subquery can always be converted to a JOIN operation, and the SQL statement can be rewritten as:
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,part,
(select
l_partkey lp,
0.2*avg(l_quantity) lq
from
lineitem
group by
l_partkey
) lpq
where
p_partkey = l_partkey
and p_brand = 'Brand#33'
and p_container = 'LG DRUM'
and l_partkey = lp
and l_quantity < lq;
The subquery lpq is a grouping & aggregation operation, the results of which can be regarded as a table with lp as the primary key. Thus, the main query is equivalent to an aggregation operation on the result set of associating the fact table lineitem with the dimension tables lpq and part.
1. Data storage
There is no special requirement for tables lineitem and part, store them in order by primary key.
Continue to use lineitem.ctx from Q3 and part.ctx from Q2. Copy them to the main directory of this query.
2. General method
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
=brand="Brand#33" |
=container="LG DRUM" |
3 |
=file("part.ctx").open().cursor@m(P_PARTKEY;P_BRAND==brand && P_CONTAINER==container).fetch().keys@im(P_PARTKEY) |
|
4 |
=file("lineitem.ctx").open() |
|
5 |
=A4.cursor@m(L_PARTKEY,L_QUANTITY;A3.find(L_PARTKEY)) |
|
6 |
=A5.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(avg/=5).keys@im(L_PARTKEY) |
|
7 |
=A4.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A6) |
|
8 |
=A7.total(sum(if(L_QUANTITY<L_PARTKEY.avg,L_EXTENDEDPRICE,0)))/7.0 |
|
9 |
=interval@ms(A1,now()) |
A5 and A7 adopt the pre-cursor filtering technology, with the filter condition being search by primary key.
Test result:
Test items |
Execution time (seconds) |
General method |
10 |
3. Data conversion
For this query, we need to use the dimension table primary key sequence-numberization method. Copy lineitem_4.ctx from Q8 and part_2.ctx from Q16 to the main directory of this query.
In the part_2.ctx of Q16, the enumeration string field p_brand has been converted to numbers. We need to convert the enumeration string field p_container to numbers.
Code for data conversion:
A |
|
1 |
=file("part_16_3.ctx").open().cursor().fetch() |
2 |
=A1.id(P_CONTAINER).sort() |
3 |
=file("p_container.btx").export@b(A2) |
4 |
=A1.run(P_CONTAINER=A2.pos@b(P_CONTAINER)) |
5 |
=file("part_17_3.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT) |
6 |
>A5.append(A4.cursor()) |
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
=file("p_brand.btx").import@b().(_1) |
|
3 |
=file("p_container.btx").import@b().(_1) |
|
4 |
=brand=A2.pos@b("Brand#33") |
=container=A3.pos@b("LG DRUM") |
5 |
=file("part_17_3.ctx").open() |
=A5.cursor().skip().(false) |
6 |
=A5.cursor@m(P_PARTKEY;P_BRAND==A4 && P_CONTAINER==B4).fetch().(B5(P_PARTKEY)=true) |
|
7 |
=file("lineitem_4.ctx").open() |
|
8 |
=A7.cursor@m(L_PARTKEY,L_QUANTITY;B5(L_PARTKEY)) |
|
9 |
=A8.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(B5(L_PARTKEY)=avg*0.2) |
|
10 |
=A7.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:B5:#) |
|
11 |
=A10.total(sum(if(L_QUANTITY<L_PARTKEY,L_EXTENDEDPRICE,0)))/7.0 |
|
12 |
=interval@ms(A1,now()) |
A6 is an alignment sequence, and its value is true or false.
A9 updates the value of the alignment sequence A6 to false or avg*0.2. Thus, in A10, the L_PARTKEY of the record that meets the condition will be assigned a value of avg*0.2, allowing it to participate in subsequent calculations in A11.
Test result:
Test items |
Execution time (seconds) |
General method |
10 |
Data conversion |
9 |
After data conversion, the speed is not improved obviously. The reason is that the filter condition of pre-cursor filtering technology used in previous section is a search calculation by primary key, the performance of which is not much different from that of the alignment sequence used in this section.
4. Column-wise computing
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
=file("p_brand.btx").import@b().(_1) |
|
3 |
=file("p_container.btx").import@b().(_1) |
|
4 |
=brand=A2.pos@b("Brand#33") |
=container=A3.pos@b("LG DRUM") |
5 |
=file("part_17_3.ctx").open() |
=A5.cursor().skip().(false) |
6 |
=A5.cursor@mv(P_PARTKEY;P_BRAND==A4 && P_CONTAINER==B4).fetch().(B5(P_PARTKEY)=true) |
|
7 |
=file("lineitem_4.ctx").open() |
|
8 |
=A7.cursor@mv(L_PARTKEY,L_QUANTITY;B5(L_PARTKEY)) |
|
9 |
=A8.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(B5(L_PARTKEY)=avg*0.2) |
|
10 |
=A7.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;B5(L_PARTKEY)) |
|
11 |
=A10.total(sum(if(L_QUANTITY<B5(L_PARTKEY),L_EXTENDEDPRICE,0)))/7.0 |
|
12 |
=interval@ms(A1,now()) |
Since the columnar cursor cannot use switch to assign values to fields, the filter condition in A10 is modified from L_PARTKEY:B5:# in the previous section as B5(L_PARTKEY), which can only implement filtering but not assignment. A11needstouseB5(L_PARTKEY) to calculate again.
Test result:
Test items |
Execution time (seconds) |
General method |
10 |
Data conversion |
9 |
Column-wise computing |
6 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version