Performance optimization case course: TPCH-Q17

 

select
    sum(l_extendedprice) / 7.0 as avg_yearly
from
    lineitem,part
where
    p_partkey = l_partkey
    and p_brand = 'Brand#33'
    and p_container = 'LG DRUM'
    and l_quantity < (
        select
            0.2 * avg(l_quantity)
        From
            Lineitem
        Where
            l_partkey = p_partkey
);

Such an association with the primary table according to the equivalence condition in the subquery can always be converted to a JOIN operation, and the SQL statement can be rewritten as:

select
    sum(l_extendedprice) / 7.0 as avg_yearly
from
    lineitem,part,
    (select
        l_partkey lp,
        0.2*avg(l_quantity) lq
from
        lineitem
    group by
        l_partkey
    ) lpq
where
    p_partkey = l_partkey
    and p_brand = 'Brand#33'
    and p_container = 'LG DRUM'
    and l_partkey = lp
    and l_quantity < lq;

The subquery lpq is a grouping & aggregation operation, the results of which can be regarded as a table with lp as the primary key. Thus, the main query is equivalent to an aggregation operation on the result set of associating the fact table lineitem with the dimension tables lpq and part.

1. Data storage

There is no special requirement for tables lineitem and part, store them in order by primary key.

Continue to use lineitem.ctx from Q3 and part.ctx from Q2. Copy them to the main directory of this query.

2. General method

Calculation code:


A

B

1

=now()


2

=brand="Brand#33"

=container="LG DRUM"

3

=file("part.ctx").open().cursor@m(P_PARTKEY;P_BRAND==brand && P_CONTAINER==container).fetch().keys@im(P_PARTKEY)

4

=file("lineitem.ctx").open()

5

=A4.cursor@m(L_PARTKEY,L_QUANTITY;A3.find(L_PARTKEY))

6

=A5.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(avg/=5).keys@im(L_PARTKEY)

7

=A4.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A6)

8

=A7.total(sum(if(L_QUANTITY<L_PARTKEY.avg,L_EXTENDEDPRICE,0)))/7.0

9

=interval@ms(A1,now())


A5 and A7 adopt the pre-cursor filtering technology, with the filter condition being search by primary key.

Test result:

Test items

Execution time (seconds)

General method

10

3. Data conversion

For this query, we need to use the dimension table primary key sequence-numberization method. Copy lineitem_4.ctx from Q8 and part_2.ctx from Q16 to the main directory of this query.

In the part_2.ctx of Q16, the enumeration string field p_brand has been converted to numbers. We need to convert the enumeration string field p_container to numbers.

Code for data conversion:


A

1

=file("part_16_3.ctx").open().cursor().fetch()

2

=A1.id(P_CONTAINER).sort()

3

=file("p_container.btx").export@b(A2)

4

=A1.run(P_CONTAINER=A2.pos@b(P_CONTAINER))

5

=file("part_17_3.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT)

6

>A5.append(A4.cursor())

Calculation code:


A

B

1

=now()


2

=file("p_brand.btx").import@b().(_1)

3

=file("p_container.btx").import@b().(_1)

4

=brand=A2.pos@b("Brand#33")

=container=A3.pos@b("LG DRUM")

5

=file("part_17_3.ctx").open()

=A5.cursor().skip().(false)

6

=A5.cursor@m(P_PARTKEY;P_BRAND==A4 && P_CONTAINER==B4).fetch().(B5(P_PARTKEY)=true)

7

=file("lineitem_4.ctx").open()

8

=A7.cursor@m(L_PARTKEY,L_QUANTITY;B5(L_PARTKEY))

9

=A8.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(B5(L_PARTKEY)=avg*0.2)

10

=A7.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:B5:#)

11

=A10.total(sum(if(L_QUANTITY<L_PARTKEY,L_EXTENDEDPRICE,0)))/7.0

12

=interval@ms(A1,now())

A6 is an alignment sequence, and its value is true or false.

A9 updates the value of the alignment sequence A6 to false or avg*0.2. Thus, in A10, the L_PARTKEY of the record that meets the condition will be assigned a value of avg*0.2, allowing it to participate in subsequent calculations in A11.

Test result:

Test items

Execution time (seconds)

General method

10

Data conversion

9

After data conversion, the speed is not improved obviously. The reason is that the filter condition of pre-cursor filtering technology used in previous section is a search calculation by primary key, the performance of which is not much different from that of the alignment sequence used in this section.

4. Column-wise computing

Calculation code:


A

B

1

=now()


2

=file("p_brand.btx").import@b().(_1)

3

=file("p_container.btx").import@b().(_1)

4

=brand=A2.pos@b("Brand#33")

=container=A3.pos@b("LG DRUM")

5

=file("part_17_3.ctx").open()

=A5.cursor().skip().(false)

6

=A5.cursor@mv(P_PARTKEY;P_BRAND==A4 && P_CONTAINER==B4).fetch().(B5(P_PARTKEY)=true)

7

=file("lineitem_4.ctx").open()

8

=A7.cursor@mv(L_PARTKEY,L_QUANTITY;B5(L_PARTKEY))

9

=A8.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(B5(L_PARTKEY)=avg*0.2)

10

=A7.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;B5(L_PARTKEY))

11

=A10.total(sum(if(L_QUANTITY<B5(L_PARTKEY),L_EXTENDEDPRICE,0)))/7.0

12

=interval@ms(A1,now())

Since the columnar cursor cannot use switch to assign values to fields, the filter condition in A10 is modified from L_PARTKEY:B5:# in the previous section as B5(L_PARTKEY), which can only implement filtering but not assignment. A11needstouseB5(L_PARTKEY) to calculate again.

Test result:

Test items

Execution time (seconds)

General method

10

Data conversion

9

Column-wise computing

6