Learn performance optimization skills from TPCH tests – Q4

 

I、 Query Requirement

Q4 queries to get order priority statistics. Calculate the number of orders, among which at least one row of each order is received by the customer after its submission date, within a given three-month period.            

The characteristic of Q4 is single table query operation with grouping, sorting, aggregation and sub-query. Subqueries are correlated subqueries.

 

II、 Oracle Execution

The query SQL written in Oracle is as follows:

 select  /*+ parallel(n) */

         o_orderpriority,

         count(*) as order_count

from

         orders

where

         o_orderdate >= date '1995-10-01'

         and o_orderdate < date '1995-10-01' + interval '3' month

         and exists (

                   select * from lineitem

                   where l_orderkey = o_orderkey and l_commitdate < l_receiptdate

         )

group by

         o_orderpriority

order by

         o_orderpriority;

/*+ parallel(n) */ is the parallel query syntax of Oracle, and n is the parallel number.

Script execution time, Unit: seconds

Number of parallel

1

2

4

8

12

Oracle

504

311

270

189

207

III、 SPL Optimization

Analyze the SQL statement in which there is an EXISTS sub-query, find out the records of the lineitem sub-table satisfying the condition l_commitdate < l_receiptdate, and the corresponding records of the orders primary table.

All the EXISTS clause can be rewritten as JOIN. In fact, SQL usually converts EXISTS to JOIN to implement. Otherwise, the calculation logic of EXISTS will result in the complexity of N*M (N and M are the record numbers of primary-sub tables respectively), whose performance is totally unacceptable.

Rewriting EXISTS to JOIN is as follows:

 select  /*+ parallel(n) */

       o_orderpriority,

       count(*) as order_count

from (

       select distinct l_orderkey,o_orderpriority from

       orders,lineitem

       where

              o_orderdate >= date '1995-10-01'

              and o_orderdate < date '1995-10-01' + interval '3' month

              and l_orderkey = o_orderkey

              and l_commitdate < l_receiptdate

)

group by

       o_orderpriority

order by

       o_orderpriority;

After rewriting, we can find that this is still a question of primary-sub table joining, and the JOIN operation here is to generate certain filtering condition for the primary table. The intermediate result is actually corresponding to the orders table, while the result of primary-sub table joining generally corresponds to the sub-table, therefore, DISTINCT operation should be done on the join result for the orderkey.

On one hand, SPL can adopt the method of ordered merging mentioned in Q3 to implement JOIN; on the other hand, it can also use the orderliness of the result set (the join result set is ordered to the orderkey) to quickly achieve the DISTINCT operation.

 

The SPL script is as follows


A

1

=now()

2

1995-10-01

3

=elapse@m(A2,3)

4

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2   && O_ORDERDATE<A3)

5

=file("lineitem.ctx").open().news(A4,L_ORDERKEY,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE)

6

=A5.group@1(L_ORDERKEY).groups(O_ORDERPRIORITY;count(1):order_count)

7

return interval@ms(A1,now())

 

A4 and A5 use the technique explained before. group@1 in A6 implies DISTINCT for ordered cursors (only the first record of each group is selected).

 

Script execution time, Unit: seconds

Number of parallel

1

2

4

8

12

Oracle

504

311

270

189

207

SPL composite table

160

81

41

24

20

Parallel performance is still excellent.

 

IV、 Further Optimization

As mentioned in the above analysis, the JOIN operation here is essentially to filter the primary table. We can also optimize it as follows: the sub-table is regarded as the set field of the primary table, and aggregate values are calculated directly and used as filtering conditions. Here we can count the number of records which satisfy the condition L_COMMITDATE < L_RECEIPTDATE under the same L_ORDERKEY. If the number equals to 0, the corresponding records of the primary table will be filtered out. In this manner, the record number of the join result is greatly reduced by changing the operation of group & DISTINCT to a more efficient filtering operation, which further improves the performance.

 

The SPL script is as follows


A

1

=now()

2

1995-10-01

3

=elapse@m(A2,3)

4

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2   && O_ORDERDATE<A3)

5

=file("lineitem.ctx").open().new(A4,O_ORDERPRIORITY,count():c;L_COMMITDATE   < L_RECEIPTDATE).select(c>0)

6

=A5.groups(O_ORDERPRIORITY;count(1):order_count)

7

return interval@ms(A1,now())

In the new function of A5, the record number of the sub-table satisfying the conditions is calculated, and the record with a result of more than 0 is filtered and retained.

Script execution time, Unit: seconds

Number of parallel

1

2

4

8

12

Oracle

504

311

270

189

207

SPL composite table

160

81

41

24

20

After further optimization

153

78

39

21

15

Similarly, this operation is easy to parallelize as well.