Performance optimization case course: TPCH-Q18

 

select * from (
    select
        c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice,
        sum(l_quantity)
    from
        customer, orders, lineitem
    where
        o_orderkey in (
            select
                l_orderkey
            from
                lineitem
            group by
                l_orderkey
            having
                sum(l_quantity) > 314
        )
        and c_custkey = o_custkey
        and o_orderkey = l_orderkey
    group by
        c_name,
        c_custkey,
        o_orderkey,
        o_orderdate,
        o_totalprice
    order by
        o_totalprice desc,
        o_orderdate
) where rownum<=100;

Let’s examine the SQL statement. If we name the following subquery

select
    l_orderkey, sum(l_quantity) lq
from
    lineitem
group by
    l_orderkey

view lo, the body of the original query statement can be written as:

select * from (
    select
        c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, lq
    from
        customer, orders, lo
    where
        c_custkey = o_custkey
        and o_orderkey = l_orderkey
        and lq > 314
    order by
        o_totalprice desc,
        o_orderdate
) where rownum<=100;

Now the computation on the view lo is a grouping & aggregation operation, and the result can be regarded as a table with l_orderkey as its primary key. Thus, the main query is equivalent to associating the fact tables orders and lo, then associating the dimension table customer, and finally performing conditional filtering and sorting.

1. Data storage

Store orders and lineitem in order by primary key, which can implementorderedmerging. Store customer in order byprimary key.

Continue to use orders.ctx, lineitem.ctx and customer.ctx from Q3. Copy them to the main directory of this query.

2. General method

Since both orders and lo are ordered by the primary key orderkey, they are equivalent to being homo dimension table to each other, allowing for order-based merge join.

The result set of taking the top 100 after ordered merging and aggregation is equivalent to a small fact table, which will be joined with customer. We need to use the big dimension table search method mentioned in previous articles.

Calculation code:


A

1

=now()

2

>quantity=314

3

=file("lineitem.ctx").open().cursor@m(L_ORDERKEY,L_QUANTITY)

4

=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select(quantities>quantity)

5

=file("orders.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities)

6

=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE))

7

=file("customer.ctx").open()

8

=A6.joinx@q(O_CUSTKEY,A7:C_CUSTKEY,C_NAME)

9

=interval@ms(A1,now())

A8 uses the joinx@q function explained in previous articles to deduplicate O_CUSTKEY in the result of A6 and perform a batch search in customer table.

Test result:

Test items

Execution time (seconds)

General method

9

It should be noted that the dimension table customer will not appear in the ordered merge calculation of fact tables before A7; the join after A7 is a search operation on big dimension table. Throughout this process, the dimension table primary key sequence-numberization method cannot be used, so there is no need to improve performance through data conversion in this query.

3. Column-wise computing

Perform column-wise computing based on ordered merge of fact tables and big dimension table search.

Calculation code:


A

1

=now()

2

>quantity=314

3

=file("lineitem.ctx").open().cursor@mv(L_ORDERKEY,L_QUANTITY)

4

=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select@v(quantities>quantity)

5

=file("orders.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities)

6

=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE))

7

=file("customer.ctx").open()

8

=A6.joinx@q(O_CUSTKEY,A7:C_CUSTKEY,C_NAME)

9

=interval@ms(A1,now())

A3 gets the columnar cursor.

A4 uses the group function to do column-wise computing, and then select@v also performs column-wise computing. The result will return a columnar cursor.

Test result:

Test items

Execution time (seconds)

General method

9

Column-wise computing

2