Learn performance optimization skills from TPCH tests – Q4

 

I、 Query Requirement

Q4 statement query to get order priority statistics. Calculate the number of orders for a given three-month period, with at least one line in each order received by the customer after its submission date.            

The characteristics of Q4 statement are: single table query operation with grouping, sorting, aggregation and sub-query. Subqueries are related subqueries.

 

II、 Oracle Implementation

The query SQL statements written by Oracle are as follows:

 

select  /*+ parallel(n) */

o_orderpriority,

count(*) as order_count

from

orders

where

o_orderdate >= date '1995-10-01'

and o_orderdate < date '1995-10-01' + interval '3' month

and exists (

select * from lineitem

where l_orderkey = o_orderkey and l_commitdate < l_receiptdate

)

group by

o_orderpriority

order by

o_orderpriority;

Where /*+ parallel(n) */ is Oracle's parallel query syntax, and n is the parallel number.

Script execution time, Unit: seconds

Number of parallel

1

2

4

8

12

Oracle

504

311

270

189

207

III、 SPL optimization

Analyzing the SQL statement, it has a exists sub-query, finding out the records of the sub-table lineitem satisfying the condition l_commitdate < l_receiptdate ,  and the corresponding records of the main table orders.

The exists clause can be rewritten to join. In fact, SQL usually converts exists into join when it is implemented. Otherwise, if the exists computing logic is followed, the complexity of N*M (N and M are the number of records of main and sub-table respectively) will be caused. This performance is totally unacceptable.

After rewriting exists to join, it’s like this:

 

select  /*+ parallel(n) */

o_orderpriority,

count(*) as order_count

from (

select distinct l_orderkey,o_orderpriority from

orders,lineitem

where

o_orderdate >= date '1995-10-01'

and o_orderdate < date '1995-10-01' + interval '3' month

and l_orderkey = o_orderkey

and l_commitdate < l_receiptdate

)

group by

o_orderpriority

order by

o_orderpriority;

 

After rewriting, it can be found that this is still a problem of main and sub-table joining, and the join operation here is to generate some filtering condition for the main table. The intermediate result set is actually corresponding to the orders table, while the general result set of main and sub-table join corresponds to the sub-table, so the join result should be de-duplicated for the orderkey.

On the one hand, SPL can use the ordered merging method mentioned in Q3 to realize the join, on the other hand, it can also use the orderliness of the result set (the join result set is ordered to the orderkey) to quickly realize the de-duplication operation.

 

SPL script is as follows


A

1

=1

2

=now()

3

>date=date("1995-10-01")

4

=elapse@m(date,3)

5

=file(path+"orders.ctx").create().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=date && O_ORDERDATE < A4;A1)

6

=file(path+"lineitem.ctx").create().news(A5,L_ORDERKEY,L_COMMITDATE,L_RECEIPTDATE,O_ORDERPRIORITY; L_COMMITDATE < L_RECEIPTDATE)

7

=A6.group@1(L_ORDERKEY).groups(O_ORDERPRIORITY;count(1):order_count)

8

=now()

9

=interval@s(A2,A8)

A5 and A6 used the technique explained before. Group@1 in A7 implies de-duplication for ordered cursors (only the first record is taken after grouping).

 

Script execution time, Unit: seconds

Number of parallel

1

2

4

8

12

Oracle

504

311

270

189

207

SPL group table

160

81

41

24

20

Parallel performance is still excellent.

 

IV、 Further optimization

As mentioned in the above analysis, the join operation here is essentially to filter the main table. We can also optimize it according to the idea that the sub-table is regarded as the set field of the main table, and aggregate values are calculated directly and used as filtering conditions. Here we can count the records of L_COMMITDATE < L_RECEIPTDATE under the same L_ORDERKEY. If equal to 0, we can filter out the corresponding records of the main table. This will greatly reduce the number of records of the join result set, and change the operation of group de-duplication into a more efficient filtering operation, and further improve the performance.

 

SPL script is as follows


A

1

=1

2

=now()

3

>date=date("1995-10-01")

4

=elapse@m(date,3)

5

=file(path+"orders.ctx").create().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=date && O_ORDERDATE < A4;A1)

6

=file(path+"lineitem.ctx").create().new(A5,O_ORDERPRIORITY,count():c;L_COMMITDATE < L_RECEIPTDATE).select(c>0)

7

=A6.groups(O_ORDERPRIORITY;count(1):order_count)

8

=now()

9

=interval@s(A2,A8)

In the new function of A6, the number of records in the sub-table satisfying the conditions is calculated, and the filter retains the record with the result more than 0.

Script execution time, Unit: seconds

Number of parallel

1

2

4

8

12

Oracle

504

311

270

189

207

SPL group table

160

81

41

24

20

After further optimization

153

78

39

21

15

Similarly, this operation is easy to parallel.