"select * from ( select s_name, count(*) as numwait from supplier,lineitem l1,orders,nation where .."

chenhu RaqForum 81 No.
1 Reply • 76 View • 1 Months ago

Performance optimization case course: TPCH-Q21

SPL(24)

select * from (
    select
        s_name,
        count(*) as numwait
    from
        supplier,lineitem l1,orders,nation
    where
        s_suppkey = l1.l_suppkey
        and o_orderkey = l1.l_orderkey
        and o_orderstatus = 'F'
        and l1.l_receiptdate > l1.l_commitdate
        and exists (
            select
                *
            from
                lineitem l2
            where
                l2.l_orderkey = l1.l_orderkey
                and l2.l_suppkey <> l1.l_suppkey
        )
        and not exists (
            select
                *
            from
                lineitem l3
            where
                l3.l_orderkey = l1.l_orderkey
                and l3.l_suppkey <> l1.l_suppkey
                and l3.l_receiptdate > l3.l_commitdate
        )
        and s_nationkey = n_nationkey
        and n_name = 'CHINA'
    group by
        s_name
    order by
        numwait desc,
        s_name
) where rownum<=100;

The main query is a grouping & aggregation operation on the result set of joining between the primary-sub table and the foreign key table. The filtering condition is relatively complex and contains two exists subqueries.

1. Data storage

Both exists sub-queries perform computation on lineitem records under the same L_ORDERKEY.

Store the orders table and lineitem table in order by the primary key ORDERKEY. When associating the two tables, perform order-based merge by ORDERKEY, and perform order-based grouping but without aggregation on the result set of association and keep the grouped subsets. The grouped subsets are small sets composed of lineitem records with the same L_ORDERKEY value. It will be simpler to calculate the two exists conditions mentioned above based on these small sets.

Continue to use orders.ctx and lineitem.ctx from Q3 query and supplier.ctx and nation.btx from Q2 query. Copy these tables to the main directory of this query.

2. General method

Calculation code:

	A
1	=now()
2	>name="CHINA"
3	=file("nation.btx").import@b().select@1(N_NAME==name).N_NATIONKEY
4	=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME;S_NATIONKEY==A3).fetch().keys@im(S_SUPPKEY)
5	=file("orders.ctx").open().cursor@m(O_ORDERKEY;O_ORDERSTATUS=="F")
6	=(lineitemFile=file("lineitem.ctx").open()).news@r(A5,L_ORDERKEY:ORDERKEY;A4.find(L_SUPPKEY))
7	=lineitemFile.news(A6:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_RECEIPTDATE,L_COMMITDATE)
8	=A7.group(L_ORDERKEY)
9	=A8.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.L_SUPPKEY) && !t.pselect(L_SUPPKEY!=s) && ~.pselect(L_SUPPKEY!=s),t,null) )
10	=A9.switch@i(L_SUPPKEY,A4)
11	=A10.groups@u(L_SUPPKEY.S_NAME:s_name;count(1):numwait)
12	=A11.top(100;-numwait,s_name)
13	=interval@ms(A1,now())

A6 associates the sub-table lineitem with the primary table orders, filters the association result and selects the ORDERKEY values satisfying the L_SUPPKEY condition.

When multiple records of lineitem correspond to one record of orders, news@r will not copy the record of orders.

Since opening a file takes time, the opened file here is assigned to the variable lineitemFile to avoid reopening it in A7.

A7 associates lineitem with A6 to select the desired fields. When multiple records of lineitem correspond to one record of orders, news will copy the record of orders.

A8 performs order-based grouping on A7. The grouped subsets are small sets composed of lineitem records with the same L_ORDERKEY value.

A9 performs calculations on these small sets, which is equivalent to implementing two exists condition judgments in SQL.

The exists subquery in SQL is equivalent to requiring that all L_SUPPKEY values in a small set cannot be the same, while the not exists subquery requires that all L_SUPPKEY values in a subset after filtering the small set by the condition L_RECEIPTDATE > L_COMMITDATE must be the same.

To determine if all L_SUPPKEY values in a set are the same, just determine whether the L_SUPPKEY value of set member is the same as that of a specified member. The s in A9 is the specified member.

The rest of the code associates with other foreign key tables and performs regular grouping operation.

Test result:

Test items	Execution time (seconds)
General method	21

3. Data conversion

For this query, we need to use two optimization methods mentioned in previous articles: dimension table primary key sequence-numberization and date-integer conversion. Copy supplier_2.ctx from Q2 and lineitem_3.ctx from Q12 to the main directory of this query.

Based on orders.ctx in Q3, convert the enumeration string field O_ORDERSTATUS to numbers.

Code for data conversion:

	A
1	=file("orders.ctx").open()
2	=A1.cursor(O_ORDERSTATUS)
3	=A2.id(O_ORDERSTATUS).sort()
4	=file("o_orderstatus.btx").export@b(A3)
5	=A1.cursor().run(O_ORDERSTATUS=A3.pos@b(O_ORDERSTATUS))
6	=file("orders_21_3.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT)
7	>A6.append(A5)

Calculation code:

	A	B
1	=now()
2	>name="CHINA"
3	=file("o_orderstatus.btx").import@b().(_1).(~=="F")
4	=file("nation_2.btx").import@b().(if(N_NAME==name,N_NATIONKEY,null))
5	=file("supplier_2.ctx").open()	=A5.cursor@m().skip().(null)
6	=A5.cursor@m(S_SUPPKEY,S_NAME;A4(S_NATIONKEY)).fetch().(B5(S_SUPPKEY)=S_NAME)
7	=file("orders_21_3.ctx").open().cursor@m(O_ORDERKEY;A3(O_ORDERSTATUS))
8	=(lineitemFile=file("lineitem_3.ctx").open()).news@r(A7,L_ORDERKEY:ORDERKEY;B5(L_SUPPKEY))
9	=lineitemFile.news(A8:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_COMMITDATE,L_RECEIPTDATE)
10	=A9.group(L_ORDERKEY)
11	=A10.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.#2) && !t.pselect(#2!=s) && ~.pselect(#2!=s),t.select(B5(L_SUPPKEY)),null) )
12	=A11.groups@u(L_SUPPKEY:s_name;count(1):numwait).run@m(s_name=B5(s_name))
13	=A12.top(100;-numwait,s_name)
14	=interval@ms(A1,now())

A3, A4 and B5 are aligned sequences.

Test result:

Test items	Execution time (seconds)
General method	21
Data conversion	16

4. Column-wise computing

Calculation code:

	A	B
1	=now()
2	>name="CHINA"
3	=file("o_orderstatus.btx").import@b().(_1).(~=="F")
4	=file("nation_2.btx").import@b().(if(N_NAME==name,N_NATIONKEY,null))
5	=file("supplier_2.ctx").open()	=A5.cursor@m().skip().(null)
6	=A5.cursor@m(S_SUPPKEY,S_NAME;A4(S_NATIONKEY)).fetch().(B5(S_SUPPKEY)=S_NAME)
7	=file("orders_21_3.ctx").open().cursor@mv(O_ORDERKEY;A3(O_ORDERSTATUS))
8	=(lineitemFile=file("lineitem_3.ctx").open()).news@r(A7,L_ORDERKEY:ORDERKEY,min(L_SUPPKEY):SK,count(1):cnt; B5(L_SUPPKEY) && L_RECEIPTDATE>L_COMMITDATE)
9	=lineitemFile.news@r(A8:ORDERKEY,L_ORDERKEY,SK,cnt,count(L_RECEIPTDATE>L_COMMITDATE && SK!=L_SUPPKEY):c1,count(SK!=L_SUPPKEY):c2).select@v(c1==0 && c2!=0)
10	=A9.groups@u(SK;sum(cnt):numwait).new(B5(SK):s_name,numwait)
11	=A10.top(100;-numwait,s_name)
12	=interval@ms(A1,now())

Since column-wise computing cannot use temporary variables, A8 defines the
computed columns SK and cnt.

A9 compares the L_SUPPKEY values of the records in the grouped subset with SK and directly summarize the computed columns c1 and c2. When multiple records of lineitem correspond to one record of orders, news@r will directly aggregate the lineitem records without copying the record of orders. Then, utilize the select function to filter the conditions of c1 and c2.

Test result:

Test items	Execution time (seconds)
General method	21
Data conversion	16
Column-wise computing	8

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

SPL(24)

esProc

chenhu • 76 View • 1 Months ago

Performance optimization case course: TPCH-Q21

1. Data storage

2. General method

3. Data conversion

4. Column-wise computing

ToC