Performance Optimization Exercises Using TPC-H – Q19

 

SQL code and analysis

Below is the SQL query statement:

select
    sum(l_extendedprice * (1 - l_discount)) as revenue
from
    lineitem,part
where
    (
        p_partkey = l_partkey
        and p_brand = 'Brand#32'
        and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
        and l_quantity >= 7 and l_quantity <= 7 + 10
        and p_size between 1 and 5
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    )
    or
    (
        p_partkey = l_partkey
        and p_brand = 'Brand#23'
        and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
        and l_quantity >= 18 and l_quantity <= 18 + 10
        and p_size between 1 and 10
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    )
    or
    (
        p_partkey = l_partkey
        and p_brand = 'Brand#45'
        and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
        and l_quantity >= 22 and l_quantity <= 22 + 10
        and p_size between 1 and 15
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    );

This is an filtering and aggregate query based on two-table association. It has simple structure but complicated filtering condition.

SPL solution

In the filtering expression, there are many items that only involve the relatively small part table and that are complex to compute (there is IN operator). If we first compute these items on part table, there will be much less compute-intensive than performing computations on result set of the two-table association. This is because the association result set, which is as big as the lineitem table, is much larger.


A

1

=now()

2

>brand1="Brand#32"

3

>brand2="Brand#23"

4

>brand3="Brand#45"

5

>quantity1=7

6

>quantity2=18

7

>quantity3=22

8

=["SM CASE", "SM BOX", "SM PACK", "SM PKG"]

9

=["MED BAG", "MED BOX", "MED PKG", "MED PACK"]

10

=["LG CASE", "LG BOX", "LG PACK", "LG PKG"]

11

=["AIR","AIR REG"]

12

=file("part.ctx").open().cursor@m(P_PARTKEY,P_BRAND,P_SIZE,P_CONTAINER,0:FLAG;P_SIZE>=1)

13

=A12.run(FLAG=if(P_BRAND==brand1 && A8.contain(P_CONTAINER) && P_SIZE<=5:1, P_BRAND==brand2 && A9.contain(P_CONTAINER) && P_SIZE<=10:2, P_BRAND==brand3 && A10.contain(P_CONTAINER) && P_SIZE<=15:3; 0)).select(FLAG>0).fetch@o().keys@im(P_PARTKEY)

14

=file("lineitem.ctx").open().cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;L_PARTKEY:A13,L_SHIPINSTRUCT=="DELIVER IN PERSON" && A11.contain(L_SHIPMODE) && L_QUANTITY>= quantity1 && L_QUANTITY<=quantity3+10)

15

=A14.groups(;sum(if(case(L_PARTKEY.FLAG,1:L_QUANTITY<=quantity1+10,2:L_QUANTITY>=quantity2 && L_QUANTITY<=quantity2+10;L_QUANTITY>=quantity3),L_EXTENDEDPRICE*(1-L_DISCOUNT),0)))

16

=interval@ms(A1,now())

A12 adds a FLAG field to part table for computing the complex conditional items involving part table. When a record meets item 1, item 2 and item 3 in the SQL OR condition, assign corresponding values 1, 2, 3 to FLAG; and assign 0 to the field when neither item is met. Then we filter records according to whether or not FLAG is greater than 0. A15 selects the quantity value according to FLAG value to preform filtering on L_QUANTITY. The FLAG-related complex formula just needs to compute the condition-related number for each row of part table.

Further optimization

1. Optimization method

In this example, we will use string-integer conversion method explained in Q1 – part table’s P_BRAND field and P_CONTAINER field have been converted in Q16 and Q17 respectively, lineitem table’s L_SHIPMODE field has been converted in Q12 and its L_SHIPINSTRUCT field needs to be converted here, and dimension table primary key numberization method explained in Q2 – part table’s P_PARTKEY field and lineitem table’s L_PARTKEY field have been converted in the previous examples.

2. Code for data conversion

2.1 Conversion on part table

Copy part_17.ctx and rename it part_19.ctx.

2.2 Conversion on lineitem table


A

1

=file("lineitem.ctx").open().cursor(L_SHIPINSTRUCT)

2

=A1.id(L_SHIPINSTRUCT).sort()

3

=file("l_shipinstruct.txt").export(A2)

4

=file("lineitem_18.ctx").open().cursor()

5

=A4.run(L_SHIPINSTRUCT=A2.pos@b(L_SHIPINSTRUCT))

6

=file("lineitem_19.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)

7

>A6.append(A5)

3. Code after data conversion

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(p_brand,file("p_brand.txt").import@si())

2

>env(p_container,file("p_container.txt").import@si())

3

>env(l_shipinstruct,file("l_shipinstruct.txt").import@si())

4

>env(l_shipmode,file("l_shipmode.txt").import@si())

5

>env(part, file("part_19.ctx").open().import())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:


A

1

=now()

2

>brand1=p_brand.pos@b("Brand#32")

3

>brand2=p_brand.pos@b("Brand#23")

4

>brand3=p_brand.pos@b("Brand#45")

5

>quantity1=7

6

>quantity2=18

7

>quantity3=22

8

=p_container.(["SM CASE", "SM BOX", "SM PACK", "SM PKG"].contain(~))

9

=p_container.(["MED BAG", "MED BOX", "MED PKG", "MED PACK"].contain(~))

10

=p_container.(["LG CASE", "LG BOX", "LG PACK", "LG PKG"].contain(~))

11

=l_shipmode.(["AIR","AIR REG"].contain(~))

12

=l_shipinstruct.pos@b("DELIVER IN PERSON")

13

=part.@m(if(P_SIZE<1:null, P_BRAND==brand1 && A8(P_CONTAINER) && P_SIZE<=5:1, P_BRAND==brand2 && A9(P_CONTAINER) && P_SIZE<=10:2, P_BRAND==brand3 && A10(P_CONTAINER) && P_SIZE<=15:3; null))

14

=file("lineitem_19.ctx").open().cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;L_PARTKEY:A13:#,L_SHIPINSTRUCT==A12 && A11(L_SHIPMODE) && L_QUANTITY>= quantity1 && L_QUANTITY<=quantity3+10)

15

=A14.groups(;sum(if(case(L_PARTKEY,1:L_QUANTITY<=quantity1+10,2:L_QUANTITY>=quantity2 && L_QUANTITY<=quantity2+10,3:L_QUANTITY>=quantity3),L_EXTENDEDPRICE*(1-L_DISCOUNT),0)))

16

=interval@ms(A1,now())

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

>brand1="Brand#32"

3

>brand2="Brand#23"

4

>brand3="Brand#45"

5

>quantity1=7

6

>quantity2=18

7

>quantity3=22

8

=["SM CASE", "SM BOX", "SM PACK", "SM PKG"].i()

9

=["MED BAG", "MED BOX", "MED PKG", "MED PACK"].i()

10

=["LG CASE", "LG BOX", "LG PACK", "LG PKG"].i()

11

=["AIR","AIR REG"].i()

12

=file("part.ctx").open().cursor@mv(P_PARTKEY,P_BRAND,P_SIZE,P_CONTAINER;P_SIZE>=1)

13

=A12.derive@o(if(P_BRAND==brand1 && A8.contain(P_CONTAINER) && P_SIZE<=5:1, P_BRAND==brand2 && A9.contain(P_CONTAINER) && P_SIZE<=10:2, P_BRAND==brand3 && A10.contain(P_CONTAINER) && P_SIZE<=15:3; 0):FLAG).select@v(FLAG>0).fetch@o().keys@im(P_PARTKEY)

14

=file("lineitem.ctx").open().cursor@mv(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;L_PARTKEY:A13,L_QUANTITY>= quantity1 && L_QUANTITY<=quantity3+10 && L_SHIPINSTRUCT=="DELIVER IN PERSON" && A11.contain(L_SHIPMODE))

15

=A14.groups(;sum(if(case(L_PARTKEY.FLAG,1:L_QUANTITY<=quantity1+10,2:L_QUANTITY>=quantity2 && L_QUANTITY<=quantity2+10;L_QUANTITY>=quantity3),L_EXTENDEDPRICE*(1-L_DISCOUNT),0)))

16

=interval@ms(A1,now())

2. Optimized data

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(p_brand,file("p_brand.txt").import@si())

2

>env(p_container,file("p_container.txt").import@si())

3

>env(l_shipinstruct,file("l_shipinstruct.txt").import@si())

4

>env(l_shipmode,file("l_shipmode.txt").import@si())

5

>env(part, file("part_19.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:


A

1

=now()

2

>brand1=p_brand.pos@b("Brand#32")

3

>brand2=p_brand.pos@b("Brand#23")

4

>brand3=p_brand.pos@b("Brand#45")

5

>quantity1=7

6

>quantity2=18

7

>quantity3=22

8

=p_container.(["SM CASE", "SM BOX", "SM PACK", "SM PKG"].contain(~))

9

=p_container.(["MED BAG", "MED BOX", "MED PKG", "MED PACK"].contain(~))

10

=p_container.(["LG CASE", "LG BOX", "LG PACK", "LG PKG"].contain(~))

11

=l_shipmode.(["AIR","AIR REG"].contain(~))

12

=l_shipinstruct.pos@b("DELIVER IN PERSON")

13

=part.@m(if(P_SIZE<1:null, P_BRAND==brand1 && A8(P_CONTAINER) && P_SIZE<=5:1, P_BRAND==brand2 && A9(P_CONTAINER) && P_SIZE<=10:2, P_BRAND==brand3 && A10(P_CONTAINER) && P_SIZE<=15:3; null))

14

=file("lineitem_19.ctx").open().cursor@mv(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;L_PARTKEY:A13:#,L_SHIPINSTRUCT==A12 && A11(L_SHIPMODE) && L_QUANTITY>= quantity1 && L_QUANTITY<=quantity3+10)

15

=A14.groups(;sum(if(case(L_PARTKEY,1:L_QUANTITY<=quantity1+10,2:L_QUANTITY>=quantity2 && L_QUANTITY<=quantity2+10,3:L_QUANTITY>=quantity3),L_EXTENDEDPRICE*(1-L_DISCOUNT),0)))

16

=interval@ms(A1,now())

Test result

Unit: Second


Regular

Column-wise

Before optimization

8.2

6.2

After optimization

7.0

4.8