Learn performance optimization skills from TPCH tests – Q4
I、 Query Requirement
Q4 queries to get order priority statistics. Calculate the number of orders, among which at least one row of each order is received by the customer after its submission date, within a given three-month period.
The characteristic of Q4 is single table query operation with grouping, sorting, aggregation and sub-query. Subqueries are correlated subqueries.
II、 Oracle Execution
The query SQL written in Oracle is as follows:
select /*+ parallel(n) */
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= date '1995-10-01'
and o_orderdate < date '1995-10-01' + interval '3' month
and exists (
select * from lineitem
where l_orderkey = o_orderkey and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;
/*+ parallel(n) */ is the parallel query syntax of Oracle, and n is the parallel number.
Script execution time, Unit: seconds
Number of parallel |
1 |
2 |
4 |
8 |
12 |
Oracle |
504 |
311 |
270 |
189 |
207 |
III、 SPL Optimization
Analyze the SQL statement in which there is an EXISTS sub-query, find out the records of the lineitem sub-table satisfying the condition l_commitdate < l_receiptdate, and the corresponding records of the orders primary table.
All the EXISTS clause can be rewritten as JOIN. In fact, SQL usually converts EXISTS to JOIN to implement. Otherwise, the calculation logic of EXISTS will result in the complexity of N*M (N and M are the record numbers of primary-sub tables respectively), whose performance is totally unacceptable.
Rewriting EXISTS to JOIN is as follows:
select /*+ parallel(n) */
o_orderpriority,
count(*) as order_count
from (
select distinct l_orderkey,o_orderpriority from
orders,lineitem
where
o_orderdate >= date '1995-10-01'
and o_orderdate < date '1995-10-01' + interval '3' month
and l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;
After rewriting, we can find that this is still a question of primary-sub table joining, and the JOIN operation here is to generate certain filtering condition for the primary table. The intermediate result is actually corresponding to the orders table, while the result of primary-sub table joining generally corresponds to the sub-table, therefore, DISTINCT operation should be done on the join result for the orderkey.
On one hand, SPL can adopt the method of ordered merging mentioned in Q3 to implement JOIN; on the other hand, it can also use the orderliness of the result set (the join result set is ordered to the orderkey) to quickly achieve the DISTINCT operation.
The SPL script is as follows:
A |
|
1 |
=now() |
2 |
1995-10-01 |
3 |
=elapse@m(A2,3) |
4 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2 && O_ORDERDATE<A3) |
5 |
=file("lineitem.ctx").open().news(A4,L_ORDERKEY,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE) |
6 |
=A5.group@1(L_ORDERKEY).groups(O_ORDERPRIORITY;count(1):order_count) |
7 |
return interval@ms(A1,now()) |
A4 and A5 use the technique explained before. group@1 in A6 implies DISTINCT for ordered cursors (only the first record of each group is selected).
Script execution time, Unit: seconds
Number of parallel |
1 |
2 |
4 |
8 |
12 |
Oracle |
504 |
311 |
270 |
189 |
207 |
SPL composite table |
160 |
81 |
41 |
24 |
20 |
Parallel performance is still excellent.
IV、 Further Optimization
As mentioned in the above analysis, the JOIN operation here is essentially to filter the primary table. We can also optimize it as follows: the sub-table is regarded as the set field of the primary table, and aggregate values are calculated directly and used as filtering conditions. Here we can count the number of records which satisfy the condition L_COMMITDATE < L_RECEIPTDATE under the same L_ORDERKEY. If the number equals to 0, the corresponding records of the primary table will be filtered out. In this manner, the record number of the join result is greatly reduced by changing the operation of group & DISTINCT to a more efficient filtering operation, which further improves the performance.
The SPL script is as follows:
A |
|
1 |
=now() |
2 |
1995-10-01 |
3 |
=elapse@m(A2,3) |
4 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2 && O_ORDERDATE<A3) |
5 |
=file("lineitem.ctx").open().new(A4,O_ORDERPRIORITY,count():c;L_COMMITDATE < L_RECEIPTDATE).select(c>0) |
6 |
=A5.groups(O_ORDERPRIORITY;count(1):order_count) |
7 |
return interval@ms(A1,now()) |
In the new function of A5, the record number of the sub-table satisfying the conditions is calculated, and the record with a result of more than 0 is filtered and retained.
Script execution time, Unit: seconds
Number of parallel |
1 |
2 |
4 |
8 |
12 |
Oracle |
504 |
311 |
270 |
189 |
207 |
SPL composite table |
160 |
81 |
41 |
24 |
20 |
After further optimization |
153 |
78 |
39 |
21 |
15 |
Similarly, this operation is easy to parallelize as well.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese Version