Performance Optimization Exercises Using TPC-H – Q18

 

SQL code and analysis

Below is the SQL query statement:

select * from (
    select
        c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice,
        sum(l_quantity)
    from
        customer, orders, lineitem
    where
        o_orderkey in (
            select
                l_orderkey
            from
                lineitem
            group by
                l_orderkey
            having
                sum(l_quantity) > 314
        )
        and c_custkey = o_custkey
        and o_orderkey = l_orderkey
    group by
        c_name,
        c_custkey,
        o_orderkey,
        o_orderdate,
        o_totalprice
    order by
        o_totalprice desc,
        o_orderdate
) where rownum<=100;

Let’s examine the SQL statement. If we name the following subquery

select 
    l_orderkey, sum(l_quantity) lq
from 
    lineitem
group by 
    l_orderkey
View lo, the body of the original query statement can be written as:
select * from (
    select
        c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, lq
    from
        customer, orders, lo
    where
        c_custkey = o_custkey
        and o_orderkey = l_orderkey
        and lq > 314
    order by
        o_totalprice desc,
        o_orderdate
) where rownum<=100;

Now the computation on the view is a single-table grouping & aggregation operation. The main query is a multi-table association and filtering.

SPL solution

This is a join between orders table having a foreign key association and its homo-dimension table lo. The lineitem table, as orders table’s sub table, is also ordered by l_orderkey, and view lo, which is computed from lineitem table, should be still ordered by l_orderkey. So, we can perform a fast merge join between orders and lo.

Rather than making customer table participate in the computation from the beginning, we can associate the foreign key table with orders table based on the result set.


A

1

=now()

2

>quantity=314

3

=file("lineitem.ctx").open().cursor@m(L_ORDERKEY,L_QUANTITY)

4

=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select(quantities>quantity)

5

=file("orders.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities)

6

=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE))

7

=file("customer.ctx").open()

8

=A6.joinx@q(O_CUSTKEY,A7:C_CUSTKEY,C_NAME)

9

return interval@ms(A1,now())

A5 uses the order-based primary-sub table merge technique explained in the previous examples. A8 uses joinx@q to associate orders table and customer table in order.

Further optimization

1. Optimization method

In this example, we will use the dimension table primary key numberization method explained in Q2 – customer table’s C_CUSTKEY field and orders table’s O_CUSTKEY field have been converted in the previous examples.

2. Code for data conversion

Copy customer_13.ctx, orders_13.ctx and lineitem_17.ctx and rename them customer_18.ctx, orders_18.ctx and lineitem_18.ctx respectively.

3. Code after data conversion

First, we need to preload the dimension tables. Below is preloading code:


A

1

>env(customer, file("customer_18.ctx").open().import())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:


A

1

=now()

2

>quantity=314

3

=file("lineitem_18.ctx").open().cursor@m(L_ORDERKEY,L_QUANTITY)

4

=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select(quantities>quantity)

5

=file("orders_18.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities)

6

=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE))

7

O_ORDERDATE=date@o(O_ORDERDATE))=A6.derive(customer(O_CUSTKEY).C_NAME:C_NAME).run(

8

=interval@ms(A1,now())

The O_ORDERDATE field in orders_18.ctx has been converted to an integer type one in the previous example, so we need to use date@o to restore it to a date type one.

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

>quantity=314

3

=file("lineitem.ctx").open().cursor@mv(L_ORDERKEY,L_QUANTITY)

4

=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select@v(quantities>quantity)

5

=file("orders.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities)

6

=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE))

7

=file("customer.ctx").open()

8

=A6.joinx@q(O_CUSTKEY,A7:C_CUSTKEY,C_NAME)

9

=interval@ms(A1,now())

2. Optimized data

First, we need to preload the dimension tables. Below is preloading code:


A

1

>env(customer, file("customer_18.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:


A

1

=now()

2

>quantity=314

3

=file("lineitem_18.ctx").open().cursor@mv(L_ORDERKEY,L_QUANTITY)

4

=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select@v(quantities>quantity)

5

=file("orders_18.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities)

6

=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE))

7

=A6.derive(customer(O_CUSTKEY).C_NAME:C_NAME).run(O_ORDERDATE=date@o(O_ORDERDATE))

8

=interval@ms(A1,now())

Test result

Unit: Second


Regular

Column-wise

Before optimization

13.7

5.0

After optimization

13.7

4.9

The data conversion and preloading does not bring noticeable performance increase because the main computing time is spent in A3-A6, which the optimization methods do not work. The result set on which the optimization methods are employed has only 100 rows of data, which are too few to have a noticeable effect.