Performance optimization case course: TPCH-Q4

 

select
    o_orderpriority,
    count(*) as order_count
from
    orders
where
    o_orderdate >= date '1995-10-01'
and o_orderdate < date '1995-10-01' + interval '3' month
and exists (
        select * from lineitem
        where l_orderkey = o_orderkey and l_commitdate < l_receiptdate
    )
group by
     o_orderpriority
order by
     o_orderpriority;

This SQL statement has an exists-subquery, which is used to find out records in the sub table lineitem that satisfy the condition l_commitdate < l_receiptdate. The exists clause can be rewritten as join. The rewritten code is:

select
    o_orderpriority,
    count(*) as order_count
from (
    select
        o_orderkey,o_orderpriority
    from orders join (
        select l_orderkey
        from lineitem
        where l_commitdate < l_receiptdate
        group by l_orderkey
    ) on l_orderkey = o_orderkey
    where
        o_orderdate >= date '1995-10-01'
        and o_orderdate < date '1995-10-01' + interval '3' month
)
group by
    o_orderpriority
order by
    o_orderpriority;

It can be seen that the rewritten code is to filter, group, and aggregate the sub-table first, and then perform a one-to-one join with primary table.

1. Data storage

Store the big tables orders and lineitem in order by the primary key orderkey. Since the sub-table lineitem remains ordered to orderkey after being grouped by orderkey and aggregate, we can merge it with the primary table orders in order.

In this way, we can directly use orders.ctx and lineitem.ctx from Q3.

Copy these tables to the main directory of this query.

2. Group and aggregate before joining

Calculation idea: group the sub-table lineitem in order first, and then merge it with the primary table orders in order.

Calculation code:


A

B

1

=now()


2

1995-10-01

3

3

=elapse@m(A2,B2)

4

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2 && O_ORDERDATE<A3)

5

=file("lineitem.ctx").open().cursor(L_ORDERKEY;L_COMMITDATE<L_RECEIPTDATE;A4)

6

=A5.group(L_ORDERKEY)

7

=A4.joinx@im(O_ORDERKEY,A6:L_ORDERKEY)

8

=A7.groups(O_ORDERPRIORITY;count(1):order_count)

9

=interval@ms(A1,now())

A4, A5: Use the pre-cursor filtering mechanism.

A6: Group lineitem in order by the primary key orderkey. The result remains ordered to orderkey.

A7: Merge the grouped lineitem with the primary table orders in order by primary key.

Test result:

Test items

Execution time (seconds)

Aggregate before merging

13

3. Direct aggregation when joining

Use the news function mentioned in the previous article. When joining the primary table and the sub table, the sub table can be directly aggregated, which eliminates the process of copying primary table fields and other calculations.

Calculation code:


A

B

1

=now()


2

1995-10-01

3

3

=elapse@m(A2,B2)


4

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2 && O_ORDERDATE<A3)

5

=file("lineitem.ctx").open().news@r(A4,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE)

6

=A5.groups(O_ORDERPRIORITY;count(1):order_count)

7

=interval@ms(A1,now())

In A5, adding an open @r to the news function means that the join is based on the primary table orders. If @r is not added, the join is based on latter table by default. When the join is based on primary table, the sub table needs to be aggregated before joining.

Test result:

Test items

Execution time (seconds)

Aggregate before merging

13

Direct aggregation when joining

7

4. Data conversion

Utilize the conversion methods mentioned in previous articles: convert enumeration string field to numbers, convert date to integers.

Convert orders:


A

1

=file("orders.ctx").open().cursor(O_ORDERPRIORITY)

2

=A1.id(O_ORDERPRIORITY).sort()

3

=file("o_orderpriority.btx").export@b(A2)

4

=file("orders.ctx").open().cursor()

5

=A4.run(O_ORDERPRIORITY=A2.pos@b(O_ORDERPRIORITY),O_ORDERDATE=days@o(O_ORDERDATE))

6

=file("orders_4.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT)

7

>A6.append(A5)

Convert lineitem:


A

1

=file("lineitem.ctx").open().cursor()

2

=A1.run(L_COMMITDATE=days@o(L_COMMITDATE),L_RECEIPTDATE=days@o(L_RECEIPTDATE))

3

=file("lineitem_4.ctx").create@py(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE, L_COMMENT)

4

>A3.append(A2)

Calculation code:


A

B

1

=now()


2

1995-10-01

3

3

>o_orderpriority=file("o_orderpriority.btx").import@b().(_1)

4

=days@o(elapse@m(A2,B2))

=days@o(A2)

5

=file("orders_4.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=B4 && O_ORDERDATE<A4)

6

=file("lineitem_4.ctx").open().news@r(A5,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE)

7

=A6.groups(O_ORDERPRIORITY;count(1):order_count)

8

=A7.run(O_ORDERPRIORITY=o_orderpriority(O_ORDERPRIORITY))

9

=interval@ms(A1,now())

Test result:

Test items

Execution time (seconds)

Aggregate before merging

13

Direct aggregation when joining

7

Data conversion

6

5. Column-wise computing

Calculation code:


A

B

1

=now()


2

1995-10-01

3

3

>o_orderpriority=file("o_orderpriority.btx").import@b().(_1)

4

=days@o(elapse@m(A2,B2))

=days@o(A2)

5

=file("orders_4.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=B4 && O_ORDERDATE<A4)

6

=file("lineitem_4.ctx").open().news@r(A5,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE)

7

=A6.groups(O_ORDERPRIORITY;count(1):order_count)

8

=A7.new(o_orderpriority(O_ORDERPRIORITY):O_ORDERPRIORITY,order_count)

9

=interval@ms(A1,now())

A10: To convert the integerized O_ORDERPRIORITY back to string, the common method is to use the run function to reset the value of O_ORDERPRIORITY. However, since the data type of column cannot be changed in column-wise computing, the new function is used here.

Test result:

Test items

Execution time (seconds)

Aggregate before merging

13

Direct aggregation when joining

7

Data conversion

6

Column-wise computing

2