Performance Optimization Exercises Using TPC-H – Q3
Ⅰ SQL code and analysis
Below is the SQL query statement:
select * from (
select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date '1995-03-15'
and l_shipdate > date '1995-03-15'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
) where rownum<=10;
This is a typical primary-sub table association. On the association result, a grouping & aggregation operation is performed.
Ⅱ SPL solution
By sorting primary keys of both primary table and sub table, we can use the order-based merge algorithm to achieve association. The number of comparisons is much less and parallel computation becomes easy.
There are three fields in the GROUP BY clause - L_ORDERKEY, O_ORDERDATE and O_SHIPPRIORITY because they need to be exported to the result set. Actually, the last two fields can be determined by the first one. So, we can only group data by the first field and get values of the other two fields according to the first one, without the need to compare all the three fields.
A |
|
1 |
=now() |
2 |
1995-3-15 |
3 |
>mktsegment="BUILDING" |
4 |
=file("customer.ctx").open().cursor@m(C_CUSTKEY;C_MKTSEGMENT==mktsegment).fetch().keys@im(C_CUSTKEY) |
5 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY;O_ORDERDATE<A2 && A4.find(O_CUSTKEY)) |
6 |
=file("lineitem.ctx").open().news@r(A5,O_ORDERKEY, sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue,O_ORDERDATE,O_SHIPPRIORITY;L_SHIPDATE>A2) |
7 |
=A6.total(top(10;-revenue,O_ORDERDATE)) |
8 |
=interval@ms(A1,now()) |
A6 uses the technique of writing filtering condition in the cursor at its creation. So does A5, where A4.find mean filtering away records that cannot match the foreign key directly rather than converting them to pointers.
In A6’s order-based merge of the primary and sub tables, we use news method to skip key values that are filtered away from the primary table when retrieving the sub table, reducing the amount of retrieved sub table data. The @r option enables grouping and summarizing sub table records by primary table’s primary key.
Note that lineitem table must be stored in segments matching with corresponding segments of orders table – as @p option is used to create the composite table lineitem – in order to avoid mismatching during parallel segmentation.
Ⅲ Further optimization
1. Optimization method
This example will use several optimization methods, including dimension table preloading (load customer table) and dimension table primary key numberization (numberize customer table’s C_CUSTKEY and orders table’s O_CUSTKEY) explained in essay Q2, and string-integer conversion and date-integer conversion techniques explained in essay Q1. The lineitem table’s L_SHIPDATE field is already converted in Q1.
2. Code for data conversion
2.1 Conversion on customer table
A |
|
1 |
=file("customer.ctx").open().import() |
2 |
=A1.id(C_MKTSEGMENT).sort() |
3 |
=file("c_mktsegment.txt").export(A2) |
4 |
=A1.run(C_CUSTKEY=#, C_MKTSEGMENT=A2.pos@b(C_MKTSEGMENT)) |
5 |
=file("customer_3.ctx").create(#C_CUSTKEY, C_NAME, C_ADDRESS, C_NATIONKEY, C_PHONE, C_ACCTBAL, C_MKTSEGMENT, C_COMMENT) |
6 |
>A5.append(A4.cursor()) |
A2/A3 Convert string values of C_MKTSEGMENT field into integers.
A4 Set values of primary key C_CUSTKEY as corresponding row numbers, and values of C_MKTSEGMENT as corresponding ordinal numbers in list table of distinct values.
2.2 Conversion on orders table
A |
|
1 |
=file("customer.ctx").open().import(C_CUSTKEY).keys@im(C_CUSTKEY) |
2 |
=file("orders.ctx").open().cursor() |
3 |
=A2.run(O_CUSTKEY=A1.pfind(O_CUSTKEY),O_ORDERDATE=days@o(O_ORDERDATE)) |
4 |
=file("orders_3.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE, O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT) |
5 |
>A4.append(A3) |
A1 Retrieve customer table’s C_CUSTKEY column, set it as primary key and create index on it.
A3 Get row numbers of records whose primary key values are O_CUSTKEY and assign them to O_CUSTKEY. Then convert O_ORDERDATE field values to small integers.
2.3 Conversion on lineitem table
Use the already converted lineitem_1.ctx in Q1 and rename it lineitem_3.ctx.
3. Code after data conversion
The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:
A |
|
1 |
>env(customer, file("customer_3.ctx").open().import()) |
2 |
>env(c_mktsegment,file("c_mktsegment.txt").import@si()) |
Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
=days@o(date("1995-03-15")) |
3 |
>mktsegment=c_mktsegment.pos@b("BUILDING") |
4 |
=customer.@m(C_MKTSEGMENT==mktsegment) |
5 |
=file("orders_3.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY;O_ORDERDATE<A2 && A4(O_CUSTKEY)) |
6 |
=file("lineitem_3.ctx").open().news@r(A5,O_ORDERKEYsum(,L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue,O_ORDERDATE,O_SHIPPRIORITY;L_SHIPDATE>A2) |
7 |
=A6.total(top(10;-revenue,O_ORDERDATE)) |
8 |
=A7.run(O_ORDERDATE=date@o(O_ORDERDATE)) |
9 |
return interval@ms(A1,now()) |
A4 Compare the numberized primary key with an alignment sequence. The optimization principle is explained in Q2.
A8 Use date@o to restore date-converted integers back to the original dates.
Ⅳ Using enterprise edition’s column-wise computation
1. Original data
A |
|
1 |
=now() |
2 |
1995-3-15 |
3 |
>mktsegment="BUILDING" |
4 |
=file("customer.ctx").open().cursor@mv(C_CUSTKEY;C_MKTSEGMENT==mktsegment).fetch().keys@im(C_CUSTKEY) |
5 |
=file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY;A4.find(O_CUSTKEY) && O_ORDERDATE<A2) |
6 |
=file("lineitem.ctx").open().news@r(A5,O_ORDERKEY,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue,O_ORDERDATE,O_SHIPPRIORITY;L_SHIPDATE>A2) |
7 |
=A6.total(top(10;-revenue,O_ORDERDATE)) |
8 |
=interval@ms(A1,now()) |
The cursor()function in A4 and A5 works with @v option to enable retrieving data using column-wise cursor. A6 uses news() function to create a column-wise, synchronously grouped cursor according to A5.
2. Optimized data
The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:
A |
|
1 |
>env(customer, file("customer_3.ctx").open().import@v()) |
2 |
>env(c_mktsegment,file("c_mktsegment.txt").import@si()) |
The @v option is needed to preload each dimension table as a column-wise table sequence.
Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
=days@o(date("1995-03-15")) |
3 |
>mktsegment=c_mktsegment.pos@b("BUILDING") |
4 |
=customer.(C_MKTSEGMENT==mktsegment) |
5 |
=file("orders_3.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY;A4(O_CUSTKEY) && O_ORDERDATE<A2) |
6 |
=file("lineitem_3.ctx").open().news@r(A5,O_ORDERKEY,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue,O_ORDERDATE,O_SHIPPRIORITY;L_SHIPDATE>A2) |
7 |
=A6.total(top(10;-revenue,O_ORDERDATE)) |
8 |
=A7.new(O_ORDERKEY,revenue,date@o(O_ORDERDATE):O_ORDERDATE,O_SHIPPRIORITY) |
9 |
=interval@ms(A1,now()) |
A8 restores date-converted integers back to the original dates. Generally, run()function is used to reset values of O_ORDERDATE field. But as column data type cannot be changed under the column-wise scheme, we use new() function to perform the operation.
Ⅴ Test result
Unit: Second
Regular |
Column-wise |
|
Before optimization |
12.6 |
8.4 |
After optimization |
8.9 |
5.8 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL