Performance Optimization Exercises Using TPC-H – Q9

 

SQL code and analysis

Below is the SQL query statement:

select
    nation,
    o_year,
    sum(amount) as sum_profit
from
    (
        select
            n_name as nation,
            extract(year from o_orderdate) as o_year,
            l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
        from
            part,
            supplier,
            lineitem,
            partsupp,
            orders,
            nation
        where
            s_suppkey = l_suppkey
            and ps_suppkey = l_suppkey
            and ps_partkey = l_partkey
            and p_partkey = l_partkey
            and o_orderkey = l_orderkey
            and s_nationkey = n_nationkey
            and p_name like '%chocolate%'
    ) profit
group by
    nation,
    o_year
order by
    nation,
    o_year desc;

This is a grouping & aggregation operation on the filtered result set of multi-table association, which involves a primary-sub table association between orders table and lineitem table and an association with the foreign key table having a two-field primary key.

SPL solution


A

1

=now()

2

>name="chocolate"

3

=file("nation.btx").import@b().keys@i(N_NATIONKEY)

4

=file("part.ctx").open().cursor@m(P_PARTKEY;pos(P_NAME, name)).fetch().keys@im(P_PARTKEY)

5

=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NATIONKEY).fetch().keys@im(S_SUPPKEY)

6

=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;A4.find(PS_PARTKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

7

=file("lineitem.ctx").open().cursor@m(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;A4.find(L_PARTKEY))

8

=file("orders.ctx").open().new@r(A7,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE)

9

=A8.join@i(L_PARTKEY:L_SUPPKEY,A6,PS_SUPPLYCOST).switch(L_SUPPKEY,A5)

10

=A9.groups(L_SUPPKEY.S_NATIONKEY:nationName,year(O_ORDERDATE):o_year;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)-PS_SUPPLYCOST*L_QUANTITY):profit)

11

=A10.run(nationName=A3.find(nationName).N_NAME).sort(nationName,-o_year)

12

=interval@ms(A1,now())

There is no filtering on orders table. The filtering action only happens on lineitem table. The number of data rows after filtering is much less than that in orders table, so we handle the primary-sub table association differently from the previous example. First, A7 creates a lineitem table cursor. Then A8 filters the primary table orders according to the cursor. This way, the number of data rows retrieved is considerably decreased and efficiency is increased.

A9 uses join@i method to stitch the foreign key table’s PS_SUPPLYCOSTA field to A8’s cursor. As the foreign key table has a two-field primary key, the method of matching and conversion at cursor creation becomes inapplicable.

Further optimization

1. Optimization method

In this example, we will use the date-integer conversion optimization method explained in Q1 – orders table’s O_ORDERDATE field is already converted in Q3, and the dimension table primary key numberization method explained in Q2 – nation table’s N_NATIONKEY field, part table’s P_PARTKEY, supplier table’s S_SUPPKEY and S_NATIONKEY fields, partsupp table’s PS_PARTKEY field and PS_SUPPKEY field, and lineitem table’s l_PARTKEY field and L_SUPPKEY field are already converted in the previous examples.

2. Code for data conversion

Copy nation_8.btx, part_8.ctx, supplier_8.ctx, partsupp_2.ctx, orders_8.ctx and lineitem_8.ctx, and rename them nation_9.btx, part_9.ctx, supplier_9.ctx, partsupp_9.ctx, orders_9.ctx and lineitem_9.ctx respectively.

3. Code after data conversion

First, we need to preload dimension tables. Below is preloading code:


A

1

>env(nation, file("nation_9.btx").import@b())

2

>env(part, file("part_9.ctx").open().import())

3

>env(supplier, file("supplier_9.ctx").open().import())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:


A

1

=now()

2

>name="chocolate"

3

=part.@m(pos(P_NAME, name))

4

=supplier.@m(S_NATIONKEY)

5

=file("partsupp_9.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;A3(PS_PARTKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

6

=file("lineitem_9.ctx").open().cursor@m(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;A3(L_PARTKEY))

7

=file("orders_9.ctx").open().new@r(A6,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE)

8

=A7.select(L_PARTKEY=A5.find(L_PARTKEY,L_SUPPKEY).PS_SUPPLYCOST).switch(L_SUPPKEY,A4:#)

9

=nation.len()+1

10

=A8.groups(year(O_ORDERDATE)*A9+L_SUPPKEY:gk;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)-L_PARTKEY*L_QUANTITY):profit)

11

=A10.new(nation(gk%A9).N_NAME:nationName,gk\A9:o_year,profit).sort(nationName,-o_year)

12

=interval@ms(A1,now())

A8 searches A5 for target records according to L_PARTKEY and L_SUPPKEY, and assigns L_PARTKEY as PS_SUPPLYCOST values – during which the current PS_SUPPLYCOST values is null if no corresponding A5 record is found and select() function will filter away the current record. This method has equal performance with the previously explained join@i method.

A10 uses the grouping key technique explained in Q1 to convert the grouping operation by two fields into one by a single field gk.

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

>name="chocolate"

3

=file("nation.btx").import@b().keys@i(N_NATIONKEY)

4

=file("part.ctx").open().cursor@mv(P_PARTKEY;pos(P_NAME, name)).fetch().keys@im(P_PARTKEY)

5

=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NATIONKEY).fetch().keys@im(S_SUPPKEY)

6

=file("partsupp.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;A4.find(PS_PARTKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

7

=file("lineitem.ctx").open().cursor@mv(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;A4.find(L_PARTKEY))

8

=file("orders.ctx").open().new@r(A7,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE)

9

=A8.join@i(L_PARTKEY:L_SUPPKEY,A6,PS_SUPPLYCOST;L_SUPPKEY,A5,S_NATIONKEY)

10

=A9.groups(S_NATIONKEY:nationName,year(O_ORDERDATE):o_year;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)-PS_SUPPLYCOST*L_QUANTITY):profit)

11

=A10.new(A3.find(nationName).N_NAME:nationName,o_year,profit).sort(nationName,-o_year)

12

=interval@ms(A1,now())

2. Optimized data

First, we need to preload dimension table. Below is preloading code:


A

1

>env(nation, file("nation_9.btx").import@bv())

2

>env(part, file("part_9.ctx").open().import@v())

3

>env(supplier, file("supplier_9.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:


A

1

=now()

2

>name="chocolate"

3

=part.@m(pos(P_NAME, name))

4

=file("partsupp_9.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;A3(PS_PARTKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

5

=file("lineitem_9.ctx").open().cursor@mv(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;A3(L_PARTKEY))

6

=file("orders_9.ctx").open().new@r(A5,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE)

7

=A6.join@i(L_PARTKEY:L_SUPPKEY,A4,PS_SUPPLYCOST;L_SUPPKEY,supplier:#,S_NATIONKEY)

8

=nation.len()+1

9

=A7.groups(year(O_ORDERDATE)*A8+S_NATIONKEY:gk;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)-PS_SUPPLYCOST*L_QUANTITY):profit)

10

=A9.new(nation(gk%A8).N_NAME:nationName,gk\A8:o_year,profit).sort(nationName,-o_year)

11

=interval@ms(A1,now())

Test result

Unit: Second


Regular

Column-wise

Before optimization

33.5

11.3

After optimization

23.8

8.5