Performance Optimization Exercises Using TPC-H – Q20

 

SQL code and analysis

Below is the SQL query statement:

select
    s_name,s_address
from
    supplier,nation
where
    s_suppkey in (
        select
            ps_suppkey
        from
            partsupp
        where
            ps_partkey in (
                select
                    p_partkey
                from
                    part
                where
                    p_name like 'bisque%'
            )
            and ps_availqty > (
                select
                    0.5 * sum(l_quantity)
                from
                    lineitem
                where
                    l_partkey = ps_partkey
                    and l_suppkey = ps_suppkey
                    and l_shipdate >= date '1995-01-01'
                    and l_shipdate < date '1995-01-01' + interval '1' year
            )
    )
    and s_nationkey = n_nationkey
    and n_name = 'CHINA'
order by
    s_name;

The main query is simple, but the nested filtering condition is complex.

SPL solution

The query appears complex, but it becomes clear when divided into multiple steps:

1. Filter nation table according to the condition;

2. Filter part table according to the condition;

3. Take result of step 1 as the foreign key table to match and filter supplier table;

4. Take result of steps 2 and 3 as foreign key tables to match and filter partsupp table;

5. Take result of step 4 as the foreign key table and associate it with lineitem table, select PS_AVAILQTY field, and perform grouping and aggregation by L_PARTKEY field and L_SUPPKEY field. As mentioned previously, a subquery containing an association with the primary table according to the equivalence condition can be always rewritten as a statement of grouping operation and association with the primary table. Based on the grouping & aggregation result set, we then select the eligible L_SUPPKEY values.

6. Take result of step 5 as the foreign key table to match and filter supplier table.

The whole process is generating intermediate foreign key tables and use them to match and filter other tables.


A

1

=now()

2

1995-1-1

3

=elapse@y(A2,1)

4

>partname="bisque"

5

>nationname="CHINA"

6

=file("nation.btx").import@b().select@1(N_NAME==nationname).N_NATIONKEY

7

=file("part.ctx").open().cursor@m(P_PARTKEY;pos@h(P_NAME,partname)).fetch().keys@im(P_PARTKEY)

8

=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME,S_ADDRESS;S_NATIONKEY==A6).fetch().keys@im(S_SUPPKEY)

9

=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;A7.find(PS_PARTKEY),A8.find(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

10

=file("lineitem.ctx").open().cursor@m(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A2 && L_SHIPDATE<A3)

11

=A10.join@i(L_PARTKEY:L_SUPPKEY,A9:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY)

12

=A11.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity)

13

=A12.select(PS_AVAILQTY*2>quantity).id(L_SUPPKEY)

14

=A8.join@im(S_SUPPKEY,A13:~)

15

=A14.new(S_NAME,S_ADDRESS).sort@0(S_NAME)

16

=interval@ms(A1,now())

A14 uses join@im, where @i option deletes non-matching records and @m option enables using order-based merge to speed up computation when both S_SUPPKEY and A13 are ordered.

Further optimization

1. Optimization method

In this example, we will use the date-integer conversion optimization method explained in Q1, where linetime table’s L_SHIPDATE has been converted, and dimension table primary key numberization optimization method explained in Q2 – part table’s P_PARTKEY field, supplier table’s S_SUPPKEY field, partsupp table’s PS_PARTKEY field and PS_SUPPKEY field, and lineitem table’s L_PARTKEY field and L_SUPPKEY field have all been converted in the previous examples.

2. Code for data conversion

Copy nation_11.btx, part_19.ctx, supplier_16.ctx, partsupp_16.ctx and lineitem_19.ctx, and rename them nation_20.btx, part_20.ctx, supplier_20.ctx, partsupp_20.ctx and lineitem_20.ctx respectively.

3. Code after data conversion

First, we need to preload the dimension tables. Below is preloading code:


A

1

>env(nation, file("nation_20.btx").import@b())

2

>env(part, file("part_20.ctx").open().import())

3

>env(supplier, file("supplier_20.ctx").open().import())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:


A

1

=now()

2

1995-1-1

3

=days@o(elapse@y(A2,1))

4

=days@o(A2)

5

>partname="bisque"

6

>nationname="CHINA"

7

=nation.select@1(N_NAME==nationname).N_NATIONKEY

8

=part.@m(pos@h(P_NAME,partname))

9

=supplier.@m(S_NATIONKEY==A7)

10

=file("partsupp_20.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;A8(PS_PARTKEY) && A9(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

11

=file("lineitem_20.ctx").open().cursor@m(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A4 && L_SHIPDATE<A3)

12

=A11.join@i(L_PARTKEY:L_SUPPKEY,A10:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY)

13

=A12.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity)

14

=A13.select(PS_AVAILQTY*2>quantity).run(A9(L_SUPPKEY)=null)

15

=supplier(A9.pselect@a(~==null))

16

=A15.new(S_NAME,S_ADDRESS).sort@0(S_NAME)

17

=interval@ms(A1,now())

A14 first select eligible records and then sets values of A9’s members corresponding to each record’s L_SUPPKEY as null. A15 first gets row numbers of all A9’s members having null values and then selects corresponding records from supplier table according to these row numbers.

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

1995-1-1

3

=elapse@y(A2,1)

4

>partname="bisque"

5

>nationname="CHINA"

6

=file("nation.btx").import@b().select@1(N_NAME==nationname).N_NATIONKEY

7

=file("part.ctx").open().cursor@mv(P_PARTKEY;pos@h(P_NAME,partname)).fetch().keys@im(P_PARTKEY)

8

=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NAME,S_ADDRESS;S_NATIONKEY==A6).fetch().keys@im(S_SUPPKEY)

9

=file("partsupp.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;A7.find(PS_PARTKEY),A8.find(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

10

=file("lineitem.ctx").open().cursor@mv(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A2 && L_SHIPDATE<A3)

11

=A10.join@i(L_PARTKEY:L_SUPPKEY,A9:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY)

12

=A11.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity)

13

=A12.select@mv(PS_AVAILQTY*2>quantity).id(L_SUPPKEY)

14

=A8.join@im(S_SUPPKEY,A13:~)

15

=A14.new@m(S_NAME,S_ADDRESS).sort@o(S_NAME)

16

=interval@ms(A1,now())

2. Optimized data

First, we need to preload the dimension tables. Below is preloading code:


A

1

>env(nation, file("nation_20.btx").import@bv())

2

>env(part, file("part_20.ctx").open().import@v())

3

>env(supplier, file("supplier_20.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:


A

1

=now()

2

1995-1-1

3

=days@o(elapse@y(A2,1))

4

=days@o(A2)

5

>partname="bisque"

6

>nationname="CHINA"

7

=nation.select@1(N_NAME==nationname).N_NATIONKEY

8

=part.(pos@h(P_NAME,partname))

9

=supplier.(S_NATIONKEY==A7)

10

=file("partsupp_20.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;A8(PS_PARTKEY) && A9(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

11

=file("lineitem_20.ctx").open().cursor@mv(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A4 && L_SHIPDATE<A3)

12

=A11.join@i(L_PARTKEY:L_SUPPKEY,A10:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY)

13

=A12.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity)

14

=A13.select@mv(PS_AVAILQTY*2>quantity).run@m(A9(L_SUPPKEY)=null)

15

=supplier(A9.pselect@a(~==null))

16

=A15.new(S_NAME,S_ADDRESS).sort@0(S_NAME)

17

=interval@ms(A1,now())

Test result

Unit: Second


Regular

Column-wise

Before optimization

10.6

5.9

After optimization

8.8

3.9