Performance Optimization Exercises Using TPC-H – Q18
Ⅰ SQL code and analysis
Below is the SQL query statement:
select * from (
select
c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice,
sum(l_quantity)
from
customer, orders, lineitem
where
o_orderkey in (
select
l_orderkey
from
lineitem
group by
l_orderkey
having
sum(l_quantity) > 314
)
and c_custkey = o_custkey
and o_orderkey = l_orderkey
group by
c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice
order by
o_totalprice desc,
o_orderdate
) where rownum<=100;
Let’s examine the SQL statement. If we name the following subquery
select
l_orderkey, sum(l_quantity) lq
from
lineitem
group by
l_orderkey
View lo, the body of the original query statement can be written as:
select * from (
select
c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, lq
from
customer, orders, lo
where
c_custkey = o_custkey
and o_orderkey = l_orderkey
and lq > 314
order by
o_totalprice desc,
o_orderdate
) where rownum<=100;
Now the computation on the view is a single-table grouping & aggregation operation. The main query is a multi-table association and filtering.
Ⅱ SPL solution
This is a join between orders table having a foreign key association and its homo-dimension table lo. The lineitem table, as orders table’s sub table, is also ordered by l_orderkey, and view lo, which is computed from lineitem table, should be still ordered by l_orderkey. So, we can perform a fast merge join between orders and lo.
Rather than making customer table participate in the computation from the beginning, we can associate the foreign key table with orders table based on the result set.
A |
|
1 |
=now() |
2 |
>quantity=314 |
3 |
=file("lineitem.ctx").open().cursor@m(L_ORDERKEY,L_QUANTITY) |
4 |
=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select(quantities>quantity) |
5 |
=file("orders.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities) |
6 |
=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE)) |
7 |
=file("customer.ctx").open() |
8 |
=A6.joinx@q(O_CUSTKEY,A7:C_CUSTKEY,C_NAME) |
9 |
return interval@ms(A1,now()) |
A5 uses the order-based primary-sub table merge technique explained in the previous examples. A8 uses joinx@q to associate orders table and customer table in order.
Ⅲ Further optimization
1. Optimization method
In this example, we will use the dimension table primary key numberization method explained in Q2 – customer table’s C_CUSTKEY field and orders table’s O_CUSTKEY field have been converted in the previous examples.
2. Code for data conversion
Copy customer_13.ctx, orders_13.ctx and lineitem_17.ctx and rename them customer_18.ctx, orders_18.ctx and lineitem_18.ctx respectively.
3. Code after data conversion
First, we need to preload the dimension tables. Below is preloading code:
A |
|
1 |
>env(customer, file("customer_18.ctx").open().import()) |
Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
>quantity=314 |
3 |
=file("lineitem_18.ctx").open().cursor@m(L_ORDERKEY,L_QUANTITY) |
4 |
=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select(quantities>quantity) |
5 |
=file("orders_18.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities) |
6 |
=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE)) |
7 |
O_ORDERDATE=date@o(O_ORDERDATE))=A6.derive(customer(O_CUSTKEY).C_NAME:C_NAME).run( |
8 |
=interval@ms(A1,now()) |
The O_ORDERDATE field in orders_18.ctx has been converted to an integer type one in the previous example, so we need to use date@o to restore it to a date type one.
Ⅳ Using enterprise edition’s column-wise computation
1. Original data
A |
|
1 |
=now() |
2 |
>quantity=314 |
3 |
=file("lineitem.ctx").open().cursor@mv(L_ORDERKEY,L_QUANTITY) |
4 |
=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select@v(quantities>quantity) |
5 |
=file("orders.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities) |
6 |
=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE)) |
7 |
=file("customer.ctx").open() |
8 |
=A6.joinx@q(O_CUSTKEY,A7:C_CUSTKEY,C_NAME) |
9 |
=interval@ms(A1,now()) |
2. Optimized data
First, we need to preload the dimension tables. Below is preloading code:
A |
|
1 |
>env(customer, file("customer_18.ctx").open().import@v()) |
Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
>quantity=314 |
3 |
=file("lineitem_18.ctx").open().cursor@mv(L_ORDERKEY,L_QUANTITY) |
4 |
=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select@v(quantities>quantity) |
5 |
=file("orders_18.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities) |
6 |
=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE)) |
7 |
=A6.derive(customer(O_CUSTKEY).C_NAME:C_NAME).run(O_ORDERDATE=date@o(O_ORDERDATE)) |
8 |
=interval@ms(A1,now()) |
Ⅴ Test result
Unit: Second
Regular |
Column-wise |
|
Before optimization |
13.7 |
5.0 |
After optimization |
13.7 |
4.9 |
The data conversion and preloading does not bring noticeable performance increase because the main computing time is spent in A3-A6, which the optimization methods do not work. The result set on which the optimization methods are employed has only 100 rows of data, which are too few to have a noticeable effect.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL