Performance Optimization Exercises Using TPC-H – Q7

 

SQL code and analysis

Below is the SQL query statement:

select
    supp_nation,
    cust_nation,
    l_year,
    sum(volume) as revenue
from
    (
        select
            n1.n_name as supp_nation,
            n2.n_name as cust_nation,
            extract(year from l_shipdate) as l_year,
            l_extendedprice * (1 - l_discount) as volume
        from
            supplier,
            lineitem,
            orders,
            customer,
            nation n1,
            nation n2
        where
            s_suppkey = l_suppkey
            and o_orderkey = l_orderkey
            and c_custkey = o_custkey
            and s_nationkey = n1.n_nationkey
            and c_nationkey = n2.n_nationkey
            and (
                (n1.n_name = 'CHINA' and n2.n_name = 'RUSSIA')
                or (n1.n_name = 'RUSSIA' and n2.n_name = 'CHINA')
            )
            and l_shipdate between date '1995-01-01' and date '1996-12-31'
    ) shipping
group by
    supp_nation,
    cust_nation,
    l_year
order by
    supp_nation,
    cust_nation,
    l_year;

This is a grouping & aggregation operation on the filtered result set of multi-table association that contains a primary-sub table association between orders table and lineitem table.

SPL solution

The working principle is similar to that in Q5.


A

1

=now()

2

1995-01-01

3

1996-12-31

4

>name1="CHINA"

5

>name2="RUSSIA"

6

=file("nation.btx").import@b().keys@i(N_NATIONKEY)

7

>n1=A6.select@1(N_NAME==name1).N_NATIONKEY

8

>n2=A6.select@1(N_NAME==name2).N_NATIONKEY

9

=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NATIONKEY;S_NATIONKEY==n1 || S_NATIONKEY==n2).fetch().keys@im(S_SUPPKEY)

10

=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_NATIONKEY;C_NATIONKEY==n1 || C_NATIONKEY==n2).fetch().keys@im(C_CUSTKEY)

11

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_CUSTKEY:A10)

12

=file("lineitem.ctx").open().news(A11,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE,O_CUSTKEY;L_SHIPDATE>=A2 && L_SHIPDATE <=A3,L_SUPPKEY:A9)

13

=A12.select(O_CUSTKEY.C_NATIONKEY!=L_SUPPKEY.S_NATIONKEY)

14

=A13.groups(L_SUPPKEY.S_NATIONKEY:supp_nation,O_CUSTKEY.C_NATIONKEY:cust_nation,year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)): volume)

15

=A14.run(supp_nation=A6.find(supp_nation).N_NAME,cust_nation=A6.find(cust_nation).N_NAME)

16

=interval@ms(A1,now())

Note that A6’s nation table data is used respectively in A7 and A8 for performing filtering according to foreign key matching. This is different from SQL alias syntax.

Further optimization

1. Optimization method

In this example, we will use the date-integer conversion optimization method explained in Q1, where lineitem table’s L_SHIPDATE is already converted, and the dimension table primary key numberization –nation table’s N_NATIONKEY and N_REGIONKEY fields, supplier table’s S_SUPPKEY and S_NATIONKEY fields, customer table’s C_CUSTKEY field and C_NATIONKEY field, orders table’s O_CUSTKEY, and lineitem table’s L_SUPPKEY field are already converted.

2. Code for data conversion

Copy nation_5.btx, supplier_5.ctx, customer_5.ctx, orders_5.ctx and lineitem_5.ctx, and rename them nation_7.btx, supplier_7.ctx, customer_7.ctx, orders_7.ctx and lineitem_7.ctx.

3. Code after data conversion

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(nation, file("nation_7.btx").import@b())

2

>env(customer, file("customer_7.ctx").open().import())

3

>env(supplier, file("supplier_7.ctx").open().import())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:


A

1

=now()

2

=days@o(date("1995-01-01"))

3

=days@o(date("1996-12-31"))

4

[CHINA,RUSSIA]

5

=nation.(A4.pos(N_NAME))

6

=supplier.@m(if(A5(S_NATIONKEY),S_NATIONKEY,null))

7

=customer.@m(if(A5(C_NATIONKEY),C_NATIONKEY,null))

8

=file("orders_7.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_CUSTKEY:A7:#)

9

=file("lineitem_7.ctx").open().news(A8,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE,O_CUSTKEY;L_SHIPDATE>=A2 && L_SHIPDATE <=A3,L_SUPPKEY:A6:#)

10

=A9.select(O_CUSTKEY!=L_SUPPKEY).run(O_CUSTKEY=A5(O_CUSTKEY),L_SUPPKEY=A5(L_SUPPKEY))

11

=A10.groups@u(year(L_SHIPDATE)*9+L_SUPPKEY*3+O_CUSTKEY:gk; sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)): volume)

12

=A11.new(A4(gk%9\3):supp_nation,A4(gk%3):cust_nation,gk\9:l_year,volume).sort(supp_nation,cust_nation,l_year)

13

=interval@ms(A1,now())

A10 replaces values of O_CUSTKEY and L_SUPPKEY with corresponding values of A5’s alignment sequence, which are either 1 or 2. A11 uses the grouping key technique explained in Q1 to convert the grouping operation by three fields into one by a single filed.

After date values are converted to integers, we can still use the year() function to get the year part, as A11 does.

Same optimization methods are used here as those in Q5.

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

1995-01-01

3

1996-12-31

4

>name1="CHINA"

5

>name2="RUSSIA"

6

=file("nation.btx").import@b().keys@i(N_NATIONKEY)

7

>n1=A6.select@1(N_NAME==name1).N_NATIONKEY

8

>n2=A6.select@1(N_NAME==name2).N_NATIONKEY

9

=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NATIONKEY;S_NATIONKEY==n1 || S_NATIONKEY==n2).fetch().keys@im(S_SUPPKEY)

10

=file("customer.ctx").open().cursor@mv(C_CUSTKEY,C_NATIONKEY;C_NATIONKEY==n1 || C_NATIONKEY==n2).fetch().keys@im(C_CUSTKEY)

11

=file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_CUSTKEY:A10)

12

=file("lineitem.ctx").open().news(A11,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE,O_CUSTKEY;L_SHIPDATE>=A2 && L_SHIPDATE<=A3,L_SUPPKEY:A9)

13

=A12.select@v(O_CUSTKEY.C_NATIONKEY!=L_SUPPKEY.S_NATIONKEY)

14

=A13.groups(L_SUPPKEY.S_NATIONKEY:supp_nation,O_CUSTKEY.C_NATIONKEY:cust_nation,year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)): volume)

15

=A14.new(A6.find(supp_nation).N_NAME:supp_nation,A6.find(cust_nation).N_NAME:cust_nation,l_year,volume)

16

=interval@ms(A1,now())

2. Optimized data

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(nation, file("nation_7.btx").import@bv())

2

>env(customer, file("customer_7.ctx").open().import@v())

3

>env(supplier, file("supplier_7.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:


A

1

=now()

2

=days@o(date("1995-01-01"))

3

=days@o(date("1996-12-31"))

4

[CHINA,RUSSIA]

5

=nation.(A4.pos(N_NAME))

6

=supplier.(A5(S_NATIONKEY))

7

=customer.(A5(C_NATIONKEY))

8

=file("orders_7.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;A7(O_CUSTKEY))

9

=file("lineitem_7.ctx").open().news(A8,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE,O_CUSTKEY;L_SHIPDATE>=A2 && L_SHIPDATE<=A3 && A6(L_SUPPKEY))

10

=A9.derive@o(A5(supplier(L_SUPPKEY).S_NATIONKEY):s,A5(customer(O_CUSTKEY).C_NATIONKEY):c).select@v(s!=c)

11

=A10.groups@u(year(L_SHIPDATE)*9+c*3+s:gk; sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):volume)

12

=A11.new(A4(gk%3):supp_nation,A4(gk%9\3):cust_nation,gk\9:l_year,volume).sort(supp_nation,cust_nation,l_year)

13

=interval@ms(A1,now())

Test result

Unit: Second


Regular

Column-wise

Before optimization

16.0

10.1

After optimization

11.0

6.6