Performance Optimization Exercises Using TPC-H – Q4

 

SQL code and analysis

Below is the SQL query statement:

select
    o_orderpriority,
    count(*) as order_count
from
    orders
where
    o_orderdate >= date '1995-10-01'
    and o_orderdate < date '1995-10-01' + interval '3' month
    and exists (
        select * from lineitem
        where l_orderkey = o_orderkey and l_commitdate < l_receiptdate
    )
group by
    o_orderpriority
order by
    o_orderpriority;

This code contains an exists clause for finding records meeting l_commitdate < l_receiptdate from the sub table lineitem. We can always rewrite the exists clause as a join query, as shown below:

select
    o_orderpriority,
    count(*) as order_count
from (
    select 
        distinct l_orderkey,o_orderpriority
    from orders
    join lineitem on 
        l_orderkey = o_orderkey 
    where
        o_orderdate >= date '1995-10-01'
        and o_orderdate < date '1995-10-01' + interval '3' month
        and l_commitdate < l_receiptdate
)
group by
    o_orderpriority
order by
    o_orderpriority;

The rewritten statement still focuses on the primary-sub table association. And the association operation aims to generate a filtering condition on the primary table. The intermediate result set, actually, corresponds to orders table, so we need to perform distinct operation on the association result according to orderkey.

SPL solution

Perform association using the order-based merge explained in essay Q3, and get the filtering condition on the primary table:


A

1

=now()

2

1995-10-01

3

=elapse@m(A2,3)

4

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2 && O_ORDERDATE<A3)

5

=file("lineitem.ctx").open().news@r(A4,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE)

6

=A5.groups(O_ORDERPRIORITY;count(1):order_count)

7

=interval@ms(A1,now())

A4 and A5 use the above-mentioned technique. A5 uses news@r() to group lineitem table by L_ORDERDEY while filtering away groups that do not meet L_COMMITDATE<L_RECEIPTDATE and retaining one record in each group.

A6 groups A5’s result by O_ORDERPRIORITY and counts records in each group.

Further optimization

1. Optimization method

In this example, we need to use the “date-integer conversion” optimization method explained in essay Q1. The orders table’s O_ORDERDATE field is already converted in essay Q3. Here we just need to convert lineitem table’s L_COMMITDATE field and L_RECEIPTDATE field.

Another optimization method involved in the example is string-integer conversion, which will convert values of O_ORDERPRIORITY into integers.

2. Code for data conversion

2.1 Conversion on orders table


A

1

=file("orders.ctx").open().cursor(O_ORDERPRIORITY)

2

=A1.id(O_ORDERPRIORITY).sort()

3

=file("o_orderpriority.txt").export(A2)

4

=file("orders_3.ctx").open().cursor()

5

=A4.run(O_ORDERPRIORITY=A2.pos@b(O_ORDERPRIORITY))

6

=file("orders_4.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE, O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT)

7

>A6.append(A5)

2.2 Conversion on lineitem table


A

1

=file("lineitem_3.ctx").open().cursor()

2

=A1.run(L_COMMITDATE=days@o(L_COMMITDATE), L_RECEIPTDATE=days@o(L_RECEIPTDATE))

3

=file("lineitem_4.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)

4

>A3.append(A2)

A2 converts values of L_COMMITDATE field and L_RECEIPTDATE field into small integers.

3. Code after data conversion


A

1

>o_orderpriority=file("o_orderpriority.txt").import@si()

2

=now()

3

1995-10-01

4

=days@o(elapse@m(A3,3))

5

=days@o(A3)

6

=file("orders_4.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A5 && O_ORDERDATE<A4)

7

=file("lineitem_4.ctx").open().news@r(A6,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE)

8

=A7.groups(O_ORDERPRIORITY;count(1):order_count)

9

=A8.run(O_ORDERPRIORITY=o_orderpriority(O_ORDERPRIORITY))

10

=interval@ms(A2,now())

A9 restores the integer type O_ORDERPRIORITY field to the original string type one.

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

1995-10-01

3

=elapse@m(A2,3)

4

=file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2 && O_ORDERDATE<A3)

5

=file("lineitem.ctx").open().news@r(A4,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE)

6

=A5.groups(O_ORDERPRIORITY;count(1):order_count)

7

=interval@ms(A1,now())

The cursor()function in A4 works with @v option to enable retrieving data using column-wise cursor. A5 uses news() function to create a column-wise, synchronously grouped cursor according to A4.

2. Optimized data


A

1

>o_orderpriority=file("o_orderpriority.txt").import@si()

2

=now()

3

1995-10-01

4

=days@o(elapse@m(A3,3))

5

=days@o(A3)

6

=file("orders_4.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A5 && O_ORDERDATE<A4)

7

=file("lineitem_4.ctx").open().news@r(A6,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE)

8

=A7.groups(O_ORDERPRIORITY;count(1):order_count)

9

=A8.new(o_orderpriority(O_ORDERPRIORITY):O_ORDERPRIORITY,order_count)

10

=interval@ms(A2,now())

Test result

Unit: Second


Regular

Column-wise

Before optimization

6.5

5.2

After optimization

4.7

3.3