"select 100.00 * sum( case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount) else .."

chenhu RaqForum 81 No.
1 Reply • 167 View • 2 Months ago

Performance optimization case course: TPCH-Q14

SQL(10)

select
    100.00 * sum(
        case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount)
            else 0 end)
        / sum(l_extendedprice * (1 - l_discount) ) as promo_revenue
from
    lineitem,
    part
where
    l_partkey = p_partkey
    and l_shipdate >= date '1995-04-01'
and l_shipdate < date '1995-04-01' + interval '1' month;

This is an aggregate operation on the filtered result set of two-table association.

1. Data storage

There is no special requirement for tables lineitem and part, they are stored in order by primary key.

We can continue to use lineitem.ctx and part.ctx from Q3.

Copy these tables to the main directory of this query.

2. General method

This is a regular association-based sum calculation. We can handle it by making full use of the parallel processing.

	A
1	=now()
2	1995-4-1
3	=elapse@m(A2,1)
4	=file("part.ctx").open().cursor@m(P_PARTKEY,P_TYPE).fetch().keys@i(P_PARTKEY)
5	=file("lineitem.ctx").open().cursor@m(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=A2 &&L_SHIPDATE<A3,L_PARTKEY:A4)
6	=A5.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT),L_DISCOUNT=if(pos@h(L_PARTKEY.P_TYPE,"PROMO"),L_EXTENDEDPRICE,0))
7	=A6.total(sum(L_DISCOUNT),sum(L_EXTENDEDPRICE))
8	=100.00*A7(1)/A7(2)
9	=interval@ms(A1,now())

The code utilizes optimization methods mentioned in previous articles, such as dimension table attributization, multi-thread parallel processing.

Test result:

Test items	Execution time (seconds)
General method	103

Since the part table contains many records, and p_type is a long string field, it consumes a large amount of memory space when loading them all into memory. After loading the part into memory and joining it with lineitem, the number of string comparisons is roughly the same as the number of records in lineitem. Because the number of records in lineitem is very large, the performance is poor.

3. Calculate dimension table in advance

The string comparison in the previous section is performed after joining two tables. In fact, this is a calculation on dimension table field, which can be done in advance. Coding in SQL is roughly like:

select
    100.00 * sum(
        case when p_type then l_extendedprice * (1 - l_discount)
            else 0 end)
        / sum(l_extendedprice * (1 - l_discount) ) as promo_revenue
from
    lineitem,
    (select
        p_partkey,
        (case when p_type like 'PROMO%' then true else false end) as p_type
     from
        part
   ) p
where
    l_partkey = p_partkey
    and l_shipdate >= date '1995-04-01'
and l_shipdate < date '1995-04-01' + interval '1' month;

Calculation code:

	A
1	=now()
2	1995-4-1
3	=elapse@m(A2,1)
4	=file("part.ctx").open().cursor@m(P_PARTKEY,P_TYPE).run(P_TYPE=pos@h(P_TYPE,"PROMO")).fetch().keys@i(P_PARTKEY)
5	=file("lineitem.ctx").open().cursor@m(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=A2 &&L_SHIPDATE<A3,L_PARTKEY:A4)
6	=A5.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT),L_DISCOUNT=if(L_PARTKEY.P_TYPE,L_EXTENDEDPRICE,0))
7	=A6.total(sum(L_DISCOUNT),sum(L_EXTENDEDPRICE))
8	=100.00*A7(1)/A7(2)
9	=interval@ms(A1,now())

A4 performs string comparison calculation, and the number of calculations is the same as the number of records in part, but much smaller than the number of records in lineitem. What is finally read into memory is 1 or null, occupyingmuchlessmemory than string.

There is no need to do string comparison calculation in A5.

Test result:

Test items	Execution time (seconds)
General method	103
Calculate dimension table in advance	8

4. Data conversion

Utilize the methods mentioned in previous articles: convert date to integer, convert primary key of dimension table to sequence number.

We can directly use part_2.ctx and p_type.btx from Q2.

Copy these tables to the main directory of this query.

The lineitem_4.ctx converted in Q8 has implemented sequence-numberization of l_partkey, but did not implement integerization of the date l_shipdate, so it needs to be converted here.

Code for data conversion:

	A
1	=file("lineitem_4.ctx").open().cursor().run(L_SHIPDATE=days@o(L_SHIPDATE))
2	=file("lineitem_14_4.ctx").create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT)
3	>A2.append(A1)

Calculation code:

	A	B
1	=now()
2	1995-4-1	PROMO
3	=days@o(elapse@m(A2,1))	=days@o(A2)
4	=p_type=file("p_type.btx").import@b().(pos@h(_1,B2))
5	=file("part_2.ctx").open()	=A5.cursor@m().skip().(false)
6	=A5.cursor@m(P_PARTKEY;A4(P_TYPE)).fetch().(B5(P_PARTKEY)=true)
7	=file("lineitem_14_4.ctx").open().cursor@m(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=B3 && L_SHIPDATE<A3)
8	=A7.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT),L_DISCOUNT=if(B5(L_PARTKEY),L_EXTENDEDPRICE,0))
9	=A8.total(sum(L_DISCOUNT),sum(L_EXTENDEDPRICE))
10	=100.00*A9(1)/A9(2)

A4 and B5 are alignment sequence.

The p_type field of the part table has been converted to integer. A6 will no longer retrieve the string value of the original p_type field, but the p_partkey integer value that meet the condition of the new p_type field. As a result, both the reading of external storage and the memory usage are greatly decreased, and performance is significantly improved.

Test result:

Test items	Execution time (seconds)
General method	103
Calculate dimension table in advance	8
Data conversion	6

5. Column-wise computing

	A	B
1	=now()
2	1995-4-1	PROMO
3	=days@o(elapse@m(A2,1))	=days@o(A2)
4	=p_type=file("p_type.btx").import@b().(pos@h(_1,B2))
5	=file("part_2.ctx").open()	=A5.cursor@m().skip().(false)
6	=A5.cursor@m(P_PARTKEY;A4(P_TYPE)).fetch().(B5(P_PARTKEY)=true)
7	=file("lineitem_3.ctx").open().cursor@mv(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>=B3 && L_SHIPDATE<A3)
8	=A7.derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp,if(B5(L_PARTKEY),dp,0):dp1)
9	=A8.total(sum(dp1),sum(dp))
10	=100.00*A9(1)/A9(2)

Column-wise computing cannot assign value to field, so A8 changes the assignment statement in the previous section to new fields dp and dp1.

Test result:

Test items	Execution time (seconds)
General method	103
Calculate dimension table in advance	8
Data conversion	6
Column-wise computing	3

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

SQL(10)

esProc

chenhu • 167 View • 2 Months ago

Performance optimization case course: TPCH-Q14

1. Data storage

2. General method

3. Calculate dimension table in advance

4. Data conversion

5. Column-wise computing

ToC