Performance optimization case course: TPCH-Q15
create view revenue (supplier_no, total_revenue) as
select
l_suppkey,
sum(l_extendedprice * (1 - l_discount))
from
lineitem
where
l_shipdate >= date '1995-04-01'
and l_shipdate < date '1995-04-01' + interval '3' month
group by
l_suppkey;
select
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
from
supplier,
revenue
where
s_suppkey = supplier_no
and total_revenue = (
select
max(total_revenue)
from
revenue
)
order by
s_suppkey;
drop view revenue;
This code is to generate a view revenue first, and then associate it with supplier, and finally select the record containing the maximum value.
1. Data storage
There is no special requirement for tables lineitem and supplier, they are stored in order by primary key.
We can continue to use lineitem.ctx from Q3 and supplier.ctx from Q2.
Copy these tables to the main directory of this query.
2. General method
We divide this query into two phases. The first phase generates a view revenue, and the second phase gets the record where total_revenue is the biggest from revenue. The former is a regular grouping & aggregation operation, where parallel processing is used to improve performance.
SPL offers A.maxp method that can directly return the record containing the maximum value. Since the number of records corresponding to the maximum value will not be too large, the big dimension table search method mentioned in Q10 can be adopted.
A |
|
1 |
=now() |
2 |
1995-4-1 |
3 |
=elapse@m(A2,3) |
4 |
=file("lineitem.ctx").open().cursor@m(L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=A2 && L_SHIPDATE<A3) |
5 |
=A4.groups@u(L_SUPPKEY:supplier_no;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):total_revenue) |
6 |
=A5.maxp@a(total_revenue) |
7 |
=file("supplier.ctx").open() |
8 |
=A6.joinx@q(supplier_no,A7:S_SUPPKEY,S_NAME,S_ADDRESS,S_PHONE) |
9 |
=interval@ms(A1,now()) |
The code utilizes optimization methods mentioned in previous articles, such as dimension table attributization, multi-thread parallel processing.
A8 uses the big dimension table search method to implement the join between A6 and big dimension table supplier.
Test result:
Test items |
Execution time (seconds) |
General method |
6 |
3. Data conversion
There is a need to convert the L_SHIPDATE in lineitem table to integer as described in previous articles, and convert the primary key of lineitem’s dimension table supplier to sequence number so as to perform sequence-number-based grouping. The sequence-numberization of dimension table primary key needs to be performed based on the lineitem.ctx in Q3 and the supplier.ctx in Q2.
In addition, it also needs to copy the supplier_2.ctx that has performed sequence-numberization in Q2 to the main directory of this query.
Code for data conversion:
A |
|
1 |
=file("supplier.ctx").open().import(S_SUPPKEY).keys@im(S_SUPPKEY) |
2 |
=file("lineitem.ctx").open().cursor() |
3 |
=A2.run(L_SUPPKEY=A1.pfind(L_SUPPKEY),L_SHIPDATE=days@o(L_SHIPDATE)) |
4 |
=file("lineitem_15_3.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT) |
5 |
>A4.append(A3) |
Calculation code:
A |
|
1 |
=now() |
2 |
1995-4-1 |
3 |
=days@o(elapse@m(A2,3)) |
4 |
=days@o(A2) |
5 |
=file("lineitem_15_3.ctx").open().cursor@m(L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=A4 && L_SHIPDATE<A3) |
6 |
=A5.groups@n(L_SUPPKEY:supplier_no;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):total_revenue) |
7 |
=A6.maxp@a(total_revenue) |
8 |
=file("supplier_2.ctx").open() |
9 |
=A7.joinx@q(supplier_no,A8:S_SUPPKEY,S_NAME,S_ADDRESS,S_PHONE) |
10 |
=interval@ms(A1,now()) |
The groups@n in A6 refers to the sequence-number-based grouping mentioned in the previous article.
Test result:
Test items |
Execution time (seconds) |
General method |
6 |
Data conversion |
5 |
4. Column-wise computing
Calculation code:
A |
|
1 |
=now() |
2 |
1995-4-1 |
3 |
=days@o(elapse@m(A2,3)) |
4 |
=days@o(A2) |
5 |
=file("lineitem_15_3.ctx").open().cursor@mv(L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=A4 && L_SHIPDATE<A3) |
6 |
=A5.groups@u(L_SUPPKEY:supplier_no;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):total_revenue) |
7 |
=A6.maxp@a(total_revenue) |
8 |
=file("supplier_2.ctx").open() |
9 |
=A7.joinx@q(supplier_no,A8:S_SUPPKEY,S_NAME,S_ADDRESS,S_PHONE) |
10 |
=interval@ms(A1,now()) |
Test result:
Test items |
Execution time (seconds) |
General method |
6 |
Data conversion |
5 |
Column-wise computing |
2 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version