"select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem,part where p_partkey = l_partkey a .."

chenhu RaqForum 81 No.
1 Reply • 150 View • 2 Months ago

Performance optimization case course: TPCH-Q17

SQL(10)

select
    sum(l_extendedprice) / 7.0 as avg_yearly
from
    lineitem,part
where
    p_partkey = l_partkey
    and p_brand = 'Brand#33'
    and p_container = 'LG DRUM'
    and l_quantity < (
        select
            0.2 * avg(l_quantity)
        From
            Lineitem
        Where
            l_partkey = p_partkey
);

Such an association with the primary table according to the equivalence condition in the subquery can always be converted to a JOIN operation, and the SQL statement can be rewritten as:

select
    sum(l_extendedprice) / 7.0 as avg_yearly
from
    lineitem,part,
    (select
        l_partkey lp,
        0.2*avg(l_quantity) lq
from
        lineitem
    group by
        l_partkey
    ) lpq
where
    p_partkey = l_partkey
    and p_brand = 'Brand#33'
    and p_container = 'LG DRUM'
    and l_partkey = lp
    and l_quantity < lq;

The subquery lpq is a grouping & aggregation operation, the results of which can be regarded as a table with lp as the primary key. Thus, the main query is equivalent to an aggregation operation on the result set of associating the fact table lineitem with the dimension tables lpq and part.

1. Data storage

There is no special requirement for tables lineitem and part, store them in order by primary key.

Continue to use lineitem.ctx from Q3 and part.ctx from Q2. Copy them to the main directory of this query.

2. General method

Calculation code:

	A	B
1	=now()
2	=brand="Brand#33"	=container="LG DRUM"
3	=file("part.ctx").open().cursor@m(P_PARTKEY;P_BRAND==brand && P_CONTAINER==container).fetch().keys@im(P_PARTKEY)
4	=file("lineitem.ctx").open()
5	=A4.cursor@m(L_PARTKEY,L_QUANTITY;A3.find(L_PARTKEY))
6	=A5.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(avg/=5).keys@im(L_PARTKEY)
7	=A4.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A6)
8	=A7.total(sum(if(L_QUANTITY<L_PARTKEY.avg,L_EXTENDEDPRICE,0)))/7.0
9	=interval@ms(A1,now())

A5 and A7 adopt the pre-cursor filtering technology, with the filter condition being search by primary key.

Test result:

Test items	Execution time (seconds)
General method	10

3. Data conversion

For this query, we need to use the dimension table primary key sequence-numberization method. Copy lineitem_4.ctx from Q8 and part_2.ctx from Q16 to the main directory of this query.

In the part_2.ctx of Q16, the enumeration string field p_brand has been converted to numbers. We need to convert the enumeration string field p_container to numbers.

Code for data conversion:

	A
1	=file("part_16_3.ctx").open().cursor().fetch()
2	=A1.id(P_CONTAINER).sort()
3	=file("p_container.btx").export@b(A2)
4	=A1.run(P_CONTAINER=A2.pos@b(P_CONTAINER))
5	=file("part_17_3.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT)
6	>A5.append(A4.cursor())

Calculation code:

	A	B
1	=now()
2	=file("p_brand.btx").import@b().(_1)
3	=file("p_container.btx").import@b().(_1)
4	=brand=A2.pos@b("Brand#33")	=container=A3.pos@b("LG DRUM")
5	=file("part_17_3.ctx").open()	=A5.cursor().skip().(false)
6	=A5.cursor@m(P_PARTKEY;P_BRAND==A4 && P_CONTAINER==B4).fetch().(B5(P_PARTKEY)=true)
7	=file("lineitem_4.ctx").open()
8	=A7.cursor@m(L_PARTKEY,L_QUANTITY;B5(L_PARTKEY))
9	=A8.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(B5(L_PARTKEY)=avg*0.2)
10	=A7.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:B5:#)
11	=A10.total(sum(if(L_QUANTITY<L_PARTKEY,L_EXTENDEDPRICE,0)))/7.0
12	=interval@ms(A1,now())

A6 is an alignment sequence, and its value is true or false.

A9 updates the value of the alignment sequence A6 to false or avg*0.2. Thus, in A10, the L_PARTKEY of the record that meets the condition will be assigned a value of avg*0.2, allowing it to participate in subsequent calculations in A11.

Test result:

Test items	Execution time (seconds)
General method	10
Data conversion	9

After data conversion, the speed is not improved obviously. The reason is that the filter condition of pre-cursor filtering technology used in previous section is a search calculation by primary key, the performance of which is not much different from that of the alignment sequence used in this section.

4. Column-wise computing

Calculation code:

	A	B
1	=now()
2	=file("p_brand.btx").import@b().(_1)
3	=file("p_container.btx").import@b().(_1)
4	=brand=A2.pos@b("Brand#33")	=container=A3.pos@b("LG DRUM")
5	=file("part_17_3.ctx").open()	=A5.cursor().skip().(false)
6	=A5.cursor@mv(P_PARTKEY;P_BRAND==A4 && P_CONTAINER==B4).fetch().(B5(P_PARTKEY)=true)
7	=file("lineitem_4.ctx").open()
8	=A7.cursor@mv(L_PARTKEY,L_QUANTITY;B5(L_PARTKEY))
9	=A8.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(B5(L_PARTKEY)=avg*0.2)
10	=A7.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;B5(L_PARTKEY))
11	=A10.total(sum(if(L_QUANTITY<B5(L_PARTKEY),L_EXTENDEDPRICE,0)))/7.0
12	=interval@ms(A1,now())

Since the columnar cursor cannot use switch to assign values to fields, the filter condition in A10 is modified from L_PARTKEY:B5:# in the previous section as B5(L_PARTKEY), which can only implement filtering but not assignment. A11needstouseB5(L_PARTKEY) to calculate again.

Test result:

Test items	Execution time (seconds)
General method	10
Data conversion	9
Column-wise computing	6

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

SQL(10)

esProc

chenhu • 150 View • 2 Months ago

Performance optimization case course: TPCH-Q17

1. Data storage

2. General method

3. Data conversion

4. Column-wise computing

ToC