# I、 Query Requirement

Q4 queries to get order priority statistics. Calculate the number of orders, among which at least one row of each order is received by the customer after its submission date, within a given three-month period.

The characteristic of Q4 is single table query operation with grouping, sorting, aggregation and sub-query. Subqueries are correlated subqueries.

# II、 Oracle Execution

The query SQL written in Oracle is as follows:

select  /*+ parallel(n) */

o_orderpriority,

count(*) as order_count

from

orders

where

o_orderdate >= date '1995-10-01'

and o_orderdate < date '1995-10-01' + interval '3' month

and exists (

select * from lineitem

where l_orderkey = o_orderkey and l_commitdate < l_receiptdate

)

group by

o_orderpriority

order by

o_orderpriority;

/*+ parallel(n) */ is the parallel query syntax of Oracle, and n is the parallel number.

Script execution time, Unit: seconds

 Number of parallel 1 2 4 8 12 Oracle 504 311 270 189 207

# III、 SPL Optimization

Analyze the SQL statement in which there is an EXISTS sub-query, find out the records of the lineitem sub-table satisfying the condition l_commitdate < l_receiptdate, and the corresponding records of the orders primary table.

All the EXISTS clause can be rewritten as JOIN. In fact, SQL usually converts EXISTS to JOIN to implement. Otherwise, the calculation logic of EXISTS will result in the complexity of N*M (N and M are the record numbers of primary-sub tables respectively), whose performance is totally unacceptable.

Rewriting EXISTS to JOIN is as follows:

select  /*+ parallel(n) */

o_orderpriority,

count(*) as order_count

from (

select distinct l_orderkey,o_orderpriority from

orders,lineitem

where

o_orderdate >= date '1995-10-01'

and o_orderdate < date '1995-10-01' + interval '3' month

and l_orderkey = o_orderkey

and l_commitdate < l_receiptdate

)

group by

o_orderpriority

order by

o_orderpriority;

After rewriting, we can find that this is still a question of primary-sub table joining, and the JOIN operation here is to generate certain filtering condition for the primary table. The intermediate result is actually corresponding to the orders table, while the result of primary-sub table joining generally corresponds to the sub-table, therefore, DISTINCT operation should be done on the join result for the orderkey.

On one hand, SPL can adopt the method of ordered merging mentioned in Q3 to implement JOIN; on the other hand, it can also use the orderliness of the result set (the join result set is ordered to the orderkey) to quickly achieve the DISTINCT operation.

The SPL script is as follows

 A 1 =now() 2 1995-10-01 3 =elapse@m(A2,3) 4 =file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2   && O_ORDERDATE

A4 and A5 use the technique explained before. group@1 in A6 implies DISTINCT for ordered cursors (only the first record of each group is selected).

Script execution time, Unit: seconds

 Number of parallel 1 2 4 8 12 Oracle 504 311 270 189 207 SPL composite table 160 81 41 24 20

Parallel performance is still excellent.

# IV、 Further Optimization

As mentioned in the above analysis, the JOIN operation here is essentially to filter the primary table. We can also optimize it as follows: the sub-table is regarded as the set field of the primary table, and aggregate values are calculated directly and used as filtering conditions. Here we can count the number of records which satisfy the condition L_COMMITDATE < L_RECEIPTDATE under the same L_ORDERKEY. If the number equals to 0, the corresponding records of the primary table will be filtered out. In this manner, the record number of the join result is greatly reduced by changing the operation of group & DISTINCT to a more efficient filtering operation, which further improves the performance.

The SPL script is as follows

 A 1 =now() 2 1995-10-01 3 =elapse@m(A2,3) 4 =file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2   && O_ORDERDATE0) 6 =A5.groups(O_ORDERPRIORITY;count(1):order_count) 7 return interval@ms(A1,now())

In the new function of A5, the record number of the sub-table satisfying the conditions is calculated, and the record with a result of more than 0 is filtered and retained.

Script execution time, Unit: seconds

 Number of parallel 1 2 4 8 12 Oracle 504 311 270 189 207 SPL composite table 160 81 41 24 20 After further optimization 153 78 39 21 15

Similarly, this operation is easy to parallelize as well.