Performance Optimization Exercises Using TPC-H – Q10

 

SQL code and analysis

Below is the SQL query statement:

select * from (
    select
        c_custkey,c_name,
        sum(l_extendedprice * (1 - l_discount)) as revenue,
        c_acctbal,n_name,c_address,c_phone,c_comment
    from
        customer,orders,lineitem,nation
    where
        c_custkey = o_custkey
        and l_orderkey = o_orderkey
        and o_orderdate >= date '1993-05-01'
        and o_orderdate < date '1993-05-01' + interval '3' month
        and l_returnflag = 'R'
        and c_nationkey = n_nationkey
    group by
        c_custkey,
        c_name,
        c_acctbal,
        c_phone,
        n_name,
        c_address,
        c_comment
    order by
        revenue desc
) where rownum <=20;

This is a grouping & aggregation operation on the filtered result set of multi-table association (which involves a primary-sub table association between orders table and lineitem table). The target is to sort records by grouped aggregates and get the first 20 ones.

SPL solution

The working principle is similar to that in Q3. In the SQL statement, there are 7 fields in group by clause. In SPL, we just need to perform grouping by the first field.


A

1

=now()

2

1993-5-1

3

=elapse@m(A2,3)

4

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3)

5

=file("lineitem.ctx").open().news@r(A4,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG=="R")

6

=A5.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)

7

=A6.top(-20;revenue)

8

=file("nation.btx").import@b().keys@i(N_NATIONKEY)

9

=file("customer.ctx").open()

10

=A7.joinx@q(c_custkey,A9:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT)

11

=A10.switch(C_NATIONKEY,A8:N_NATIONKEY)

12

=A11.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,C_NATIONKEY.N_NAME,C_ADDRESS,C_PHONE,C_COMMENT)

13

=A12.sort@z(revenue)

14

=interval@ms(A1,now())

First, we perform grouping & aggregation on the result set of associating orders table and lineitem table; then based on the grouping & aggregation result set we perform the externally correlated foreign key table association. Since the target is to get 20 records, we can only perform the second association on the 20 records. There is no need to handle it before the grouping operation because computation will become more intensive that way.

After A7’s computation and as customer table is ordered by c_custkey field, we use A7.joinx@q to match records of customer table in order and get eligible ones quickly. Then we handle the other joins without the need to traverse customer table and with less data retrieved.

Further optimization

1. Optimization method

In this example, we will use the date-integer conversion optimization method explained in Q1 – orders table’s O_ORDERDATE field is already converted in Q3, the string-integer conversion method explained in Q1, where lineitem table’s L_RETRUNFLAG is already converted, and the dimension table primary key numberization method explained in Q2 – nation table’s N_NATIONKEY field, customer table’s C_CUSTKEY field and C_NATIONKEY field, orders table’s O_CUSTKEY field are already converted in the previous examples.

2. Code for data conversion

Copy nation_9.btx, customer_8.ctx, orders_9.ctx and lineitem_9.ctx, and rename them respectively as nation_10.btx, customer_10.ctx, orders_10.ctx andlineitem_10.ctx.

3. Code after data conversion

First, we need to preload dimension tables. Below is preloading code:


A

1

>env(nation, file("nation_10.btx").import@b())

2

>env(customer, file("customer_10.ctx").open().import())

3

>env(l_returnflag,file("l_returnflag.txt").import@si())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:


A

1

=now()

2

1993-5-1

3

=days@o(elapse@m(A2,3))

4

=days@o(A2)

5

=l_returnflag.pos@b("R")

6

=file("orders_10.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3)

7

=file("lineitem_10.ctx").open().news@r(A6,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG==A5)

8

=A7.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)

9

=A8.top(-20;revenue)

10

=A9.new((c=customer(c_custkey)).C_CUSTKEY,c.C_NAME,revenue,c.C_ACCTBAL,nation(c.C_NATIONKEY).N_NAME,c.C_ADDRESS,c.C_PHONE,c.C_COMMENT)

11

=A10.sort@z(revenue)

12

=interval@ms(A1,now())

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

1993-5-1

3

=elapse@m(A2,3)

4

=file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3)

5

=file("lineitem.ctx").open().news@r(A4,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG=="R")

6

=A5.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)

7

=A6.top(-20;revenue)

8

=file("nation.btx").import@bv().keys@i(N_NATIONKEY)

9

=file("customer.ctx").open()

10

=A7.joinx@q(c_custkey,A9:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT)

11

=A10.join(C_NATIONKEY,A8,N_NAME)

12

=A11.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,N_NAME,C_ADDRESS,C_PHONE,C_COMMENT)

13

=A12.sort@z(revenue)

14

=interval@ms(A1,now())

2. Optimized data

First, we need to preload dimension tables. Below is preloading code:


A

1

>env(nation, file("nation_10.btx").import@bv())

2

>env(customer, file("customer_10.ctx").open().import@v())

3

>env(l_returnflag,file("l_returnflag.txt").import@si())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:


A

1

=now()

2

1993-5-1

3

=days@o(elapse@m(A2,3))

4

=days@o(A2)

5

=l_returnflag.pos@b("R")

6

=file("orders_10.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3)

7

=file("lineitem_10.ctx").open().news@r(A6,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG==A5)

8

=A7.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)

9

=A8.top(-20;revenue)

10

=A9.new(customer(c_custkey):c,revenue).new(c.C_CUSTKEY,c.C_NAME,revenue,c.C_ACCTBAL,nation(c.C_NATIONKEY).N_NAME,c.C_ADDRESS,c.C_PHONE,c.C_COMMENT)

11

=A10.sort@z(revenue)

12

=interval@ms(A1,now())

Test result

Unit: Second


Regular

Column-wise

Before optimization

13.3

7.4

After optimization

12.2

6.2