Performance Optimization Exercises Using TPC-H – Q8

 

SQL code and analysis

Below is the SQL query statement:

select
    o_year,
    sum(case
        when nation = 'CHINA' then volume
        else 0
    end) / sum(volume) as mkt_share
from
    (
        select
            extract(year from o_orderdate) as o_year,
            l_extendedprice * (1 - l_discount) as volume,
            n2.n_name as nation
        from
            part,
            supplier,
            lineitem,
            orders,
            customer,
            nation n1,
            nation n2,
            region
        where
            p_partkey = l_partkey
            and s_suppkey = l_suppkey
            and l_orderkey = o_orderkey
            and o_custkey = c_custkey
            and c_nationkey = n1.n_nationkey
            and n1.n_regionkey = r_regionkey
            and r_name = 'ASIA'
            and s_nationkey = n2.n_nationkey
            and o_orderdate between date '1995-01-01' and date '1996-12-31'
            and p_type = 'STANDARD POLISHED BRASS'
    ) all_nations
group by
    o_year
order by
    o_year;

This is a grouping & aggregation operation on the filtered result set of multi-table association that contains a primary-sub table association between orders table and lineitem table.

SPL solution

The working principle is similar to that in Q5.


A

1

=now()

2

1995-01-01

3

1996-12-31

4

>nation="CHINA"

5

>name="ASIA"

6

>type="STANDARD POLISHED BRASS"

7

=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY

8

=file("nation.btx").import@b()

9

=A8.select@1(N_NAME==nation).N_NATIONKEY

10

=A8.select(N_REGIONKEY==A7).derive@o().keys(N_NATIONKEY)

11

=file("supplier.ctx").open().cursor@m(S_SUPPKEY;S_NATIONKEY==A9).fetch().keys@im(S_SUPPKEY)

12

=file("part.ctx").open().cursor@m(P_PARTKEY;P_TYPE==type).fetch().keys@im(P_PARTKEY)

13

=file("customer.ctx").open().cursor@m(C_CUSTKEY;A10.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY)

14

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=A3 && A13.find(O_CUSTKEY))

15

=file("lineitem.ctx").open().news(A14,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;A12.find(L_PARTKEY))

16

=A15.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT))

17

=A16.groups(year(O_ORDERDATE):o_year;sum(if(A11.find(L_SUPPKEY),L_EXTENDEDPRICE,0)):s1,sum(L_EXTENDEDPRICE):s2)

18

=A17.new(o_year,s1/s2:mkt_share)

19

=interval@ms(A1,now())

Further optimization

1. Optimization method

In this example, we will use the date-integer conversion optimization method explained in Q1 – the orders table’s OR_ORDERDATE field is already converted in Q3, the string-integer conversion method explained in Q1 – the part table’s P_TYPE field is already converted in Q2, and the dimension table primary key numberization method explained in Q2 – region table’s R_REGIONKEY field, nation table’s N_NATIONKEY and N_REGIONKEY fields, part table’s P_PARTKEY field, supplier table’s S_SUPPKEY and S_NATIONKEY fields, customer table’s C_CUSTKEY and C_NATIONKEY fields, orders table’s O_CUSTKEY field, and lineitem table’s L_SUPPKEY field are already converted in the previous examples. Here we only need to convert lineitem table’s L_PARTKEY field.

2. Code for data conversion

2.1 Conversion on region table, nation table, part table, supplier table, customer table and orders table

Copy region_5.btx, nation_7.btx, part_2.ctx, supplier_7.ctx, customer_7.ctx and orders_7.ctx, and rename them region_8.btx, nation_8.btx, part_8.ctx, supplier_8.ctx, customer_8.ctx, and orders_8.ctx respectively.

2.2 Conversion on lineitem table


A

1

=file("part.ctx").open().import(P_PARTKEY).keys@im(P_PARTKEY)

2

=file("lineitem_7.ctx").open().cursor()

3

=A2.run(L_PARTKEY=A1.pfind(L_PARTKEY))

4

=file("lineitem_8.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)

5

>A4.append(A3)

3. Code after data conversion

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(p_type,file("p_type.txt").import@si())

2

>env(region, file("region_8.btx").import@b())

3

>env(nation, file("nation_8.btx").import@b())

4

>env(supplier, file("supplier_8.ctx").open().import())

5

>env(customer, file("customer_8.ctx").open().import())

6

>env(part, file("part_8.ctx").open().import())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:


A

1

=now()

2

=days@o(date("1995-01-01"))

3

=days@o(date("1996-12-31"))

4

>nationName="CHINA"

5

>name="ASIA"

6

>type=p_type.pos@b("STANDARD POLISHED BRASS")

7

=region.pselect(R_NAME==name)

8

=nation.pselect(N_NAME==nationName)

9

=nation.(N_REGIONKEY==A7)

10

=supplier.@m(S_NATIONKEY==A8)

11

=part.@m(P_TYPE==type)

12

=customer.@m(A9(C_NATIONKEY))

13

=file("orders_8.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=A3 && A12(O_CUSTKEY))

14

=file("lineitem_8.ctx").open().news(A13,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;A11(L_PARTKEY))

15

=A14.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT))

16

=A15.groups(year(O_ORDERDATE):o_year;sum(if(A10(L_SUPPKEY),L_EXTENDEDPRICE,0)):s1,sum(L_EXTENDEDPRICE):s2)

17

=A16.new(o_year,s1/s2:mkt_share)

18

=interval@ms(A1,now())

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

1995-01-01

3

1996-12-31

4

>nation="CHINA"

5

>name="ASIA"

6

>type="STANDARD POLISHED BRASS"

7

=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY

8

=file("nation.btx").import@b()

9

=A8.select@1(N_NAME==nation).N_NATIONKEY

10

=A8.select(N_REGIONKEY==A7).derive@o().keys(N_NATIONKEY)

11

=file("supplier.ctx").open().cursor@mv(S_SUPPKEY;S_NATIONKEY==A9).fetch().keys@im(S_SUPPKEY)

12

=file("part.ctx").open().cursor@mv(P_PARTKEY;P_TYPE==type).fetch().keys@im(P_PARTKEY)

13

=file("customer.ctx").open().cursor@mv(C_CUSTKEY;A10.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY)

14

=file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=A3 && A13.find(O_CUSTKEY))

15

=file("lineitem.ctx").open().cursor@v(L_ORDERKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;A12.find(L_PARTKEY);A14)

16

=A15.derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp)

17

=A16.joinx@im(L_ORDERKEY,A14:O_ORDERKEY,O_ORDERDATE)

18

=A16.groups(year(O_ORDERDATE):o_year;sum(if(A11.find(L_SUPPKEY),dp,0)):s1,sum(dp):s2)

19

=A18.new(o_year,s1/s2:mkt_share)

20

=interval@ms(A1,now())

We do not use news()function to perform the primary-sub table association between orders table and lineitem table. Instead, joinx@im is used to perform an order-based merge join after cursors are synchronously segmented. Of course, the news() function is applicable, but joinx() function is slightly faster in our testing.

A16 uses derive@o to add a new column dp, so there is no need to compute dp again in A18.

2. Optimized data

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(p_type,file("p_type.txt").import@si())

2

>env(region, file("region_8.btx").import@bv())

3

>env(nation, file("nation_8.btx").import@bv())

4

>env(supplier, file("supplier_8.ctx").open().import@v())

5

>env(customer, file("customer_8.ctx").open().import@v())

6

>env(part, file("part_8.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:


A

1

=now()

2

=days@o(date("1995-01-01"))

3

=days@o(date("1996-12-31"))

4

>nationName="CHINA"

5

>name="ASIA"

6

>type=p_type.pos@b("STANDARD POLISHED BRASS")

7

=region.pselect(R_NAME==name)

8

=nation.pselect(N_NAME==nationName)

9

=nation.(N_REGIONKEY==A7)

10

=supplier.(S_NATIONKEY==A8)

11

=part.(P_TYPE==type)

12

=customer.(A9(C_NATIONKEY))

13

=file("orders_8.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE<=A3 && A12(O_CUSTKEY))

14

=file("lineitem_8.ctx").open().cursor@mv(L_ORDERKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;A11(L_PARTKEY);A13)

15

=A14.joinx@im(L_ORDERKEY,A13:O_ORDERKEY,O_ORDERDATE)

16

=A15.derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp)

17

=A16.groups(year(O_ORDERDATE):o_year;sum(if(A10(L_SUPPKEY),dp,0)):s1,sum(dp):s2)

18

=A17.new(o_year,s1/s2:mkt_share)

19

=interval@ms(A1,now())

Test result

Unit: Second


Regular

Column-wise

Before optimization

20.5

8.0

After optimization

13.3

5.1