Performance Optimization Exercises Using TPC-H – Q21

 

SQL code and analysis

Below is the SQL query statement:

select * from (
    select
        s_name,
        count(*) as numwait
    from
        supplier,lineitem l1,orders,nation
    where
        s_suppkey = l1.l_suppkey
        and o_orderkey = l1.l_orderkey
        and o_orderstatus = 'F'
        and l1.l_receiptdate > l1.l_commitdate
        and exists (
            select
                *
            from
                lineitem l2
            where
                l2.l_orderkey = l1.l_orderkey
                and l2.l_suppkey <> l1.l_suppkey
        )
        and not exists (
            select
                *
            from
                lineitem l3
            where
                l3.l_orderkey = l1.l_orderkey
                and l3.l_suppkey <> l1.l_suppkey
                and l3.l_receiptdate > l3.l_commitdate
        )
        and s_nationkey = n_nationkey
        and n_name = 'CHINA'
    group by
        s_name
    order by
        numwait desc,
        s_name
) where rownum<=100;

The main query is a simple grouping & aggregation operation on the result set of primary-sub table association, with a complex filtering condition containing two exists clauses.

SPL solution

Both exists clauses perform computation on lineitem records under same l_orderkey.

We know that lineitem table is already ordered by l_orderkey. Besides, records of the sub table can be regarded as a set type field of the primary table. If we perform order-based grouping (but without aggregation) on result set of the association between orders table and lineitem table on orderkey, we get a series of small sets of lineitem records having same l_orderkey value. Then it becomes simple to compute the two exists conditions on each small set.


A

1

=now()

2

>name="CHINA"

3

=file("nation.btx").import@b().select@1(N_NAME==name).N_NATIONKEY

4

=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME;S_NATIONKEY==A3).fetch().keys@im(S_SUPPKEY)

5

=file("orders.ctx").open().cursor@m(O_ORDERKEY;O_ORDERSTATUS=="F")

6

=file("lineitem.ctx").open().news@r(A5,L_ORDERKEY:ORDERKEY;A4.find(L_SUPPKEY))

7

=file("lineitem.ctx").open().news(A6:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_RECEIPTDATE,L_COMMITDATE)

8

=A7.group(L_ORDERKEY)

9

=A8.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.#2) && !t.pselect(#2!=s) && ~.pselect(#2!=s),t,null) )

10

=A9.switch@i(L_SUPPKEY,A4)

11

=A10.groups@u(L_SUPPKEY.S_NAME:s_name;count(1):numwait)

12

=A11.top(100;-numwait,s_name)

13

=interval@ms(A1,now())

A6 association orders table and lineitem table, filters association result and selects eligible ORDERKEY values. A7 associates with A6 to select the desired fields. A8 performs order-based grouping on A7 to divide it into sets of lineitem records having same l_orderkey values. A9 judges each of these small sets according to exists conditions and filters away the ineligible records. The rest of the code associates with the other foreign key table and performs the regular grouping operation on A10.

Further optimization

1. Optimization method

In this example, we will use date-integer conversion optimization method explained in Q1 – lineitem table’s L_COMMITDATE field and L_RECEIPTDATE field have been converted in the previous examples, string-integer conversion method explained in the same essay – here orders table’s O_ORDERSTATUS field needs to be converted, and dimension table primary key numberization method explained in Q2 – supplier table’s S_SUPPKEY field and lineitem table’s L_SUPPKEY field are already converted in the previous examples.

2. Code for data conversion

2.1 Conversion on nation table, supplier table and lineitem table

Copy nation_20.btx, supplier_20.ctx and lineitem_20.ctx, and rename them nation_21.btx, supplier_21.ctx and lineitem_21.ctx respectively.

2.2 Conversion on orders table


A

1

=file("orders.ctx").open().cursor(O_ORDERSTATUS)

2

=A1.id(O_ORDERSTATUS).sort()

3

=file("o_orderstatus.txt").export(A2)

4

=file("orders_18.ctx").open().cursor()

5

=A4.run(O_ORDERSTATUS=A2.pos@b(O_ORDERSTATUS))

6

=file("orders_21.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT)

7

>A6.append(A5)

3. Code after data conversion

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(o_orderstatus,file("o_orderstatus.txt").import@si())

2

>env(nation, file("nation_21.btx").import@b())

3

>env(supplier, file("supplier_21.ctx").open().import())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:


A

1

=now()

2

>name="CHINA"

3

=o_orderstatus.pos@b("F")

4

=nation.select@1(N_NAME==name).N_NATIONKEY

5

=supplier.@m(if(S_NATIONKEY==A4,S_NAME,null))

6

=file("orders_21.ctx").open().cursor@m(O_ORDERKEY;O_ORDERSTATUS==A3)

7

=file("lineitem_21.ctx").open().news@r(A6,L_ORDERKEY:ORDERKEY;A5(L_SUPPKEY))

8

=file("lineitem_21.ctx").open().news(A7:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_COMMITDATE,L_RECEIPTDATE)

9

=A8.group(L_ORDERKEY)

10

=A9.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.#2) && !t.pselect(#2!=s) && ~.pselect(#2!=s),t.select(A5(L_SUPPKEY)),null) )

11

=A10.groups@u(L_SUPPKEY:s_name;count(1):numwait).run@m(s_name=A5(s_name))

12

=A11.top(100;-numwait,s_name)

13

=interval@ms(A1,now())

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

>name="CHINA"

3

=file("nation.btx").import@b().select@1(N_NAME==name).N_NATIONKEY

4

=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NAME;S_NATIONKEY==A3).fetch().keys@im(S_SUPPKEY)

5

=file("orders.ctx").open().cursor@mv(O_ORDERKEY;O_ORDERSTATUS=="F")

6

=file("lineitem.ctx").open().news@r(A5,L_ORDERKEY:ORDERKEY,min(L_SUPPKEY):SK,count(1):cnt; A4.find(L_SUPPKEY) && L_RECEIPTDATE>L_COMMITDATE)

7

=file("lineitem.ctx").open().news@r(A6:ORDERKEY,L_ORDERKEY,SK,cnt,count(L_RECEIPTDATE>L_COMMITDATE && SK!=L_SUPPKEY):c1,count(SK!=L_SUPPKEY):c2).select@v(c1==0 && c2!=0)

8

=A7.groups@u(SK;sum(cnt):numwait).join(SK,A4,S_NAME).new(S_NAME:s_name,numwait)

9

=A8.top(100;-numwait,s_name)

10

=interval@ms(A1,now())

2. Optimized data

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(o_orderstatus,file("o_orderstatus.txt").import@si())

2

>env(nation, file("nation_21.btx").import@bv())

3

>env(supplier, file("supplier_21.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:


A

1

=now()

2

>name="CHINA"

3

=o_orderstatus.pos@b("F")

4

=nation.select@1(N_NAME==name).N_NATIONKEY

5

=supplier.(if(S_NATIONKEY==A4,S_NAME,null))

6

=file("orders_21.ctx").open().cursor@mv(O_ORDERKEY;O_ORDERSTATUS==A3)

7

=file("lineitem_21.ctx").open().news@r(A6,L_ORDERKEY:ORDERKEY,min(L_SUPPKEY):SK,count(1):cnt; A5(L_SUPPKEY) && L_RECEIPTDATE>L_COMMITDATE)

8

=file("lineitem_21.ctx").open().news@r(A7:ORDERKEY,L_ORDERKEY,SK,cnt,count(L_RECEIPTDATE>L_COMMITDATE && SK!=L_SUPPKEY):c1,count(SK!=L_SUPPKEY):c2).select@v(c1==0 && c2!=0)

9

=A8.groups@u(SK;sum(cnt):numwait).new(A5(SK):s_name,numwait)

10

=A9.top(100;-numwait,s_name)

11

=interval@ms(A1,now())

Test result

Unit: Second


Regular

Column-wise

Before optimization

19.9

14.0

After optimization

14.3

9.4