Performance Optimization Exercises Using TPC-H – Q5
Ⅰ SQL code and analysis
Below is the SQL query statement:
select
n_name,
sum(l_extendedprice * (1 - l_discount)) as revenue
from
customer,
orders,
lineitem,
supplier,
nation,
region
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and l_suppkey = s_suppkey
and c_nationkey = s_nationkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'ASIA'
and o_orderdate >= date '1995-01-01'
and o_orderdate < date '1995-01-01' + interval '1' year
group by
n_name
order by
revenue desc;
This is a grouping & aggregation operation on the filtered result set of multi-table association that contains a primary-sub table association between orders table and lineitem table.
Ⅱ SPL solution
The working principle is similar to that in Q3.
A |
|
1 |
=now() |
2 |
1995-1-1 |
3 |
=elapse@y(A2,1) |
4 |
>name="ASIA" |
5 |
=file("region.btx").import@b().select(R_NAME==name) |
6 |
=file("nation.btx").import@b().switch@i(N_REGIONKEY,A5:R_REGIONKEY).keys(N_NATIONKEY) |
7 |
=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NATIONKEY;A6.find(S_NATIONKEY)).fetch().keys@im(S_SUPPKEY) |
8 |
=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_NATIONKEY;A6.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY) |
9 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE < A3,O_CUSTKEY:A8) |
10 |
=file("lineitem.ctx").open().news(A9,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:A7) |
11 |
=A10.select(O_CUSTKEY.C_NATIONKEY==L_SUPPKEY.S_NATIONKEY) |
12 |
=A11.groups@u(L_SUPPKEY.S_NATIONKEY:N_NAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
13 |
=A12.sort@z(revenue).run(N_NAME=A6.find(N_NAME).N_NAME) |
14 |
=interval@ms(A1,now()) |
Here we use the techniques of performing filtering at cursor creation, order-based primary-sub table association, and converting join field to foreign key table pointers.
Unlike Q3, the grouping field isn’t the ordered L_ORDERKEY and group@s() function becomes unsuitable. A12 performs grouping on S_NATIONKEY instead of N_NAME because integer comparison is faster than string comparison, and replace the field with N_NAME after the grouping operation is done.
Ⅲ Further optimization
1. Optimization method
In this example, we will use the date-integer conversion optimization method explained in Q1 – the orders table’s OR_ORDERDATE field is already converted in Q3 – and the dimension table primary key numberization – region table’s R_REGIONKEY field, nation table’s N_NATIONKEY and N_REGIONKEY fields, and supplier table’s S_SUPPKEY and S_NATIONKEY fields are already converted in Q2; and orders table’s O_CUSTKEY is already converted in Q3. Here we only need to convert customer table’s C_NATIONKEY field and lineitem table’s L_SUPPKEY field.
2. Code for data conversion
2.1 Conversion on region table, nation table and supplier table
Copy region_2.btx, nation_2.btx and supplier_2.ctx, and rename them region_5.btx, nation_5.btx and supplier_5.ctx.
2.2 Conversion on orders table
Copy orders_4.ctx and name it orders_5.ctx.
2.3 Conversion on customer table
A |
|
1 |
=file("customer_3.ctx").open().import() |
2 |
=file("nation.btx").import@b() |
3 |
=A1.run(C_NATIONKEY=A2.pselect(N_NATIONKEY==C_NATIONKEY)) |
4 |
=file("customer_5.ctx").create(#C_CUSTKEY, C_NAME, C_ADDRESS, C_NATIONKEY, C_PHONE, C_ACCTBAL, C_MKTSEGMENT, C_COMMENT) |
>A4.append(A3) |
2.4 Conversion on lineitem table
A |
|
1 |
=file("supplier.ctx").open().import(S_SUPPKEY).keys@im(S_SUPPKEY) |
2 |
=file("lineitem_4.ctx").open().cursor() |
3 |
=A2.run(L_SUPPKEY=A1.pfind(L_SUPPKEY)) |
4 |
=file("lineitem_5.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT) |
>A4.append(A3) |
3. Code after data conversion
The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:
A |
|
1 |
>env(region, file(“region_5.btx”).import@b()) |
2 |
>env(nation, file("nation_5.btx").import@b()) |
3 |
>env(customer, file("customer_5.ctx").open().import()) |
4 |
>env(supplier, file("supplier_5.ctx").open().import()) |
Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
1995-1-1 |
3 |
=days@o(elapse@y(A2,1)) |
4 |
=days@o(A2) |
5 |
>name="ASIA" |
6 |
=nation.(region(N_REGIONKEY).R_NAME==name) |
7 |
=supplier.@m(if(A6(S_NATIONKEY), S_NATIONKEY,null)) |
8 |
=customer.@m(if(A6(C_NATIONKEY), C_NATIONKEY,null)) |
9 |
=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3,O_CUSTKEY:A8:#) |
10 |
=file("lineitem_5.ctx").open().news(A9,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:A7:#) |
11 |
=A10.select(O_CUSTKEY ==L_SUPPKEY) |
12 |
=A11.groups@u(L_SUPPKEY:N_NAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
13 |
=A12.sort@z(revenue).run(N_NAME=nation(N_NAME).N_NAME) |
14 |
=interval@ms(A1,now()) |
We use two optimization techniques in the code – the alignment sequence after dimension table primary key numberization and association with dimension table row numbers.
Ⅳ Using enterprise edition’s column-wise computation
1. Original data
A |
|
1 |
=now() |
2 |
1995-1-1 |
3 |
=elapse@y(A2,1) |
4 |
>name="ASIA" |
5 |
=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY |
6 |
=file("nation.btx").import@b().select(N_REGIONKEY==A5).derive@o().keys@i(N_NATIONKEY) |
7 |
=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NATIONKEY;A6.find(S_NATIONKEY)).fetch().keys@im(S_SUPPKEY) |
8 |
=file("customer.ctx").open().cursor@mv(C_CUSTKEY,C_NATIONKEY;A6.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY) |
9 |
=file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3,O_CUSTKEY:A8) |
10 |
=file("lineitem.ctx").open().news(A9,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:A7) |
11 |
=A10.select@v(O_CUSTKEY.C_NATIONKEY==L_SUPPKEY.S_NATIONKEY) |
12 |
=A11.groups@u(L_SUPPKEY.S_NATIONKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
13 |
=A12.sort@z(revenue).new(A6.find(S_NATIONKEY).N_NAME,revenue) |
14 |
=interval@ms(A1,now()) |
A11 uses @v option in select() function that performs filtering on the column-wise cursor to keep the filtered cursor still as a column-wise cursor.
2. Optimized data
The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:
A |
|
1 |
>env(region, file(“region_5.btx”).import@bv()) |
2 |
>env(nation, file("nation_5.btx").import@bv()) |
3 |
>env(customer, file("customer_5.ctx").open().import@v()) |
4 |
>env(supplier, file("supplier_5.ctx").open().import@v()) |
Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
1995-1-1 |
3 |
=days@o(elapse@y(A2,1)) |
4 |
=days@o(A2) |
5 |
>name="ASIA" |
6 |
=nation.(region(N_REGIONKEY).R_NAME==name) |
7 |
=supplier.(A6(S_NATIONKEY)) |
8 |
=customer.(A6(C_NATIONKEY)) |
9 |
=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3 && A8(O_CUSTKEY)) |
10 |
=file("lineitem_5.ctx").open().news(A9,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;A7(L_SUPPKEY)) |
11 |
=A10.derive@o(supplier(L_SUPPKEY).S_NATIONKEY,customer(O_CUSTKEY).C_NATIONKEY).select@v(C_NATIONKEY==S_NATIONKEY) |
12 |
=A11.groups@u(S_NATIONKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
13 |
=A12.sort@z(revenue).new(nation(S_NATIONKEY).N_NAME,revenue) |
14 |
=interval@ms(A1,now()) |
A9 uses A8(O_CUSTKEY) to only filter data at cursor creation without performing switch association. The reason is explained in Q2, and same operation for A10. A11 uses derive@o() to add two new columns and select@v to keep result as a column-wise cursor.
Ⅴ Test result
Unit: Second
Regular |
Column-wise |
|
Before optimization |
23.5 |
9.8 |
After optimization |
11.5 |
5.7 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL