Performance optimization case course: TPCH-Q10

 

select * from (
    select
        c_custkey,c_name,
        sum(l_extendedprice * (1 - l_discount)) as revenue,
        c_acctbal,n_name,c_address,c_phone,c_comment
    from
        customer,orders,lineitem,nation
    where
        c_custkey = o_custkey
        and l_orderkey = o_orderkey
        and o_orderdate >= date '1993-05-01'
        and o_orderdate < date '1993-05-01' + interval '3' month
        and l_returnflag = 'R'
        and c_nationkey = n_nationkey
    group by
        c_custkey,
        c_name,
        c_acctbal,
        c_phone,
        n_name,
        c_address,
        c_comment
    order by
        revenue desc
) where rownum <=20;

This is a grouping & aggregation operation on the filtered result set of multi-table association which involves a primary-sub table association between orders table and lineitem table. The target is to sort records by grouped aggregates and get the first 20 ones.

1. Data storage

Store the big tables orders and lineitem in order by the primary key orderkey, which makes it possible to merge them in order when joining. For other tables, there is no special sorting requirement, they are all stored in order by primary key.

So, we can continue to use orders.ctx, lineitem.ctx and customer.ctx from Q3.

For the nation table, we can directly use nation.btx from Q2.

Copy these tables to the main directory of this query.

2. General method


A

1

=now()

2

1993-5-1

3

=elapse@m(A2,3)

4

=file("nation.btx").import@b().keys@i(N_NATIONKEY)

5

=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT;C_NATIONKEY:A4).fetch().keys@i(C_CUSTKEY)

6

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3,O_CUSTKEY:A5)

7

=file("lineitem.ctx").open().news@r(A6,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG=="R")

8

=A7.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)

9

=A8.top(-20;revenue)

10

=A9.new(c_custkey.C_CUSTKEY,c_custkey.C_NAME,revenue,c_custkey.C_ACCTBAL,c_custkey.C_NATIONKEY.N_NAME,c_custkey.C_ADDRESS,c_custkey.C_PHONE,c_custkey.C_COMMENT)

11

=A10.sort@z(revenue)

12

=interval@ms(A1,now())

Here we use the optimization methods mentioned in the previous articles such as pre-cursor filtering, order-based primary-sub table merge, and foreign key fieldattributization.

Test result:

Test items

Execution time (seconds)

General method

17

3. Big dimension table search

We found that this operation can be changed into two steps. The first step is to associate orders with lineitem and perform grouping and aggregation operation on custkey. The second step is to associate the result set with the dimension table customer. In SQL, it can be roughly rewritten as:

select c_custkey,c_name,revenue,c_acctbal,n_name,c_address,c_phone,c_comment
from (
    select o_custkey, revenue
    from (
    	select o_custkey,
           sum(l_extendedprice * (1 - l_discount)) as revenue
    	from
           orders,lineitem
    	where
           l_orderkey = o_orderkey
        	      and o_orderdate >= date '1993-05-01'
        	      and o_orderdate < date '1993-05-01' + interval '3' month
        	      and l_returnflag = 'R'
           group by o_custkey
           order by revenue desc
    ) orders_lineitem
    where rownum <=20
) orders_lineitem20,customer,nation
where
    c_custkey = o_custkey
    and c_nationkey = n_nationkey
order by revenue desc;

Because the target is to get 20 records, and the subquery orders_lineitem20 is also to return 20 records, we just need to perform subsequent association between the 20 records and the customer table. In contrast, the original method needs to associate every record of the subquery orders_lineitem with customer table.

This is a scenario where the small fact table orders_lineitem20 joins with the big dimension table customer. We can use the big dimension table search method, that is, convert the join operation into a batch search on customer’s primary key. Since the customer table is ordered by the primary key c_custkey, the search calculation can be completed quickly.

Calculation code:


A

1

=now()

2

1993-5-1

3

=elapse@m(A2,3)

4

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3)

5

=file("lineitem.ctx").open().news@r(A4,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG=="R")

6

=A5.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)

7

=A6.top(-20;revenue)

8

=file("nation.btx").import@b().keys@i(N_NATIONKEY)

9

=file("customer.ctx").open()

10

=A7.joinx@q(c_custkey,A9:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT)

11

=A10.switch(C_NATIONKEY,A8:N_NATIONKEY)

12

=A11.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,C_NATIONKEY.N_NAME,C_ADDRESS,C_PHONE,C_COMMENT)

13

=A12.sort@z(revenue)

14

=interval@ms(A1,now())

After A7’s computation, we use joinx@q to match records of customer table in order and get eligible ones quickly. This eliminates theneedto load customer into memory or join the orders table containing a large amount of data with the customer table, thus reducing the amount of data retrieval and calculation.

Test result:

Test items

Execution time (seconds)

General method

17

Big dimension table search

7

4. Data conversion

Utilize the conversion methods mentioned in previous articles: convert enumeration string field to number, convert primary key of dimension table to sequence number, convert date to integer.

For the nation table, we can directly use nation_2.btx converted in Q2.

For the customer table, we can use customer_3.ctx from Q5.

For the orders table, we can use orders_5.ctx from Q3.

Copy these tables to the main directory of this query.

Although Q1 has converted L_RETURNFLAG in lineitem to number, lineitem.ctx does not have primary key, resulting in a failure to meet order-based merge requirement. Therefore, it needs to convert enumeration field to number based on the lineitem.ctx in Q3.

Code for data conversion:


A

1

=file("lineitem.ctx").open()

2

=A1.cursor(L_RETURNFLAG).id(L_RETURNFLAG).sort()

3

=file("l_returnflag.btx").export@b(A2)

4

=A1.cursor().run(L_RETURNFLAG =A2.pos@b(L_RETURNFLAG))

5

=file("lineitem4.ctx").create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT)

6

>A5.append(A4)

Calculation code:


A

1

=now()

2

1993-5-1

3

=days@o(elapse@m(A2,3))

4

=days@o(A2)

5

=file("nation_2.btx").import@b()

6

=file("l_returnflag.btx").import@b().(_1=="R")

7

=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3)

8

=file("lineitem4.ctx").open().news@r(A7,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;A6(L_RETURNFLAG))

9

=A8.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)

10

=A9.top(-20;revenue)

11

=file("customer.ctx").open()

12

=A10.joinx@q(c_custkey,A11:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT)

13

=A12.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,A5(C_NATIONKEY).N_NAME,C_ADDRESS,C_PHONE,C_COMMENT)

14

=A13.sort@z(revenue)

15

=interval@ms(A1,now())

A3 and A4 are to convert date to number, and A6 is aligned sequence.

The nation table in A13 is to sequence-numberize the foreign key field.

Test result:

Test items

Execution time (seconds)

General method

17

Big dimension table search

7

Data conversion

5

5. Column-wise computing


A

1

=now()

2

1993-5-1

3

=days@o(elapse@m(A2,3))

4

=days@o(A2)

5

=file("nation_2.btx").import@b()

6

=file("l_returnflag.btx").import@b().(_1=="R")

7

=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3)

8

=file("lineitem4.ctx").open().news@r(A7,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;A6(L_RETURNFLAG))

9

=A8.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue)

10

=A9.top(-20;revenue)

11

=file("customer.ctx").open()

12

=A10.joinx@q(c_custkey,A11:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT)

13

=A12.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,A5(C_NATIONKEY).N_NAME,C_ADDRESS,C_PHONE,C_COMMENT)

14

=A13.sort@z(revenue)

15

=interval@ms(A1,now())

Test result:

Test items

Execution time (seconds)

General method

17

Big dimension table search

7

Data conversion

5

Column-wise computing

3