"Ⅰ SQL code and analysis Below is the SQL query statement: select * from ( select s_name, count(* .."

Jerry RaqForum 14 No.
284 View • 1 Years ago

Performance Optimization Exercises Using TPC-H – Q21

Ⅰ SQL code and analysis

Below is the SQL query statement:

select * from (
    select
        s_name,
        count(*) as numwait
    from
        supplier,lineitem l1,orders,nation
    where
        s_suppkey = l1.l_suppkey
        and o_orderkey = l1.l_orderkey
        and o_orderstatus = 'F'
        and l1.l_receiptdate > l1.l_commitdate
        and exists (
            select
                *
            from
                lineitem l2
            where
                l2.l_orderkey = l1.l_orderkey
                and l2.l_suppkey <> l1.l_suppkey
        )
        and not exists (
            select
                *
            from
                lineitem l3
            where
                l3.l_orderkey = l1.l_orderkey
                and l3.l_suppkey <> l1.l_suppkey
                and l3.l_receiptdate > l3.l_commitdate
        )
        and s_nationkey = n_nationkey
        and n_name = 'CHINA'
    group by
        s_name
    order by
        numwait desc,
        s_name
) where rownum<=100;

The main query is a simple grouping & aggregation operation on the result set of primary-sub table association, with a complex filtering condition containing two exists clauses.

Ⅱ SPL solution

Both exists clauses perform computation on lineitem records under same l_orderkey.

We know that lineitem table is already ordered by l_orderkey. Besides, records of the sub table can be regarded as a set type field of the primary table. If we perform order-based grouping (but without aggregation) on result set of the association between orders table and lineitem table on orderkey, we get a series of small sets of lineitem records having same l_orderkey value. Then it becomes simple to compute the two exists conditions on each small set.

	A
1	=now()
2	>name="CHINA"
3	=file("nation.btx").import@b().select@1(N_NAME==name).N_NATIONKEY
4	=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME;S_NATIONKEY==A3).fetch().keys@im(S_SUPPKEY)
5	=file("orders.ctx").open().cursor@m(O_ORDERKEY;O_ORDERSTATUS=="F")
6	=file("lineitem.ctx").open().news@r(A5,L_ORDERKEY:ORDERKEY;A4.find(L_SUPPKEY))
7	=file("lineitem.ctx").open().news(A6:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_RECEIPTDATE,L_COMMITDATE)
8	=A7.group(L_ORDERKEY)
9	=A8.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.#2) && !t.pselect(#2!=s) && ~.pselect(#2!=s),t,null) )
10	=A9.switch@i(L_SUPPKEY,A4)
11	=A10.groups@u(L_SUPPKEY.S_NAME:s_name;count(1):numwait)
12	=A11.top(100;-numwait,s_name)
13	=interval@ms(A1,now())

A6 association orders table and lineitem table, filters association result and selects eligible ORDERKEY values. A7 associates with A6 to select the desired fields. A8 performs order-based grouping on A7 to divide it into sets of lineitem records having same l_orderkey values. A9 judges each of these small sets according to exists conditions and filters away the ineligible records. The rest of the code associates with the other foreign key table and performs the regular grouping operation on A10.

Ⅲ Further optimization

1. Optimization method

In this example, we will use date-integer conversion optimization method explained in Q1 – lineitem table’s L_COMMITDATE field and L_RECEIPTDATE field have been converted in the previous examples, string-integer conversion method explained in the same essay – here orders table’s O_ORDERSTATUS field needs to be converted, and dimension table primary key numberization method explained in Q2 – supplier table’s S_SUPPKEY field and lineitem table’s L_SUPPKEY field are already converted in the previous examples.

2. Code for data conversion

2.1 Conversion on nation table, supplier table and lineitem table

Copy nation_20.btx, supplier_20.ctx and lineitem_20.ctx, and rename them nation_21.btx, supplier_21.ctx and lineitem_21.ctx respectively.

2.2 Conversion on orders table

	A
1	=file("orders.ctx").open().cursor(O_ORDERSTATUS)
2	=A1.id(O_ORDERSTATUS).sort()
3	=file("o_orderstatus.txt").export(A2)
4	=file("orders_18.ctx").open().cursor()
5	=A4.run(O_ORDERSTATUS=A2.pos@b(O_ORDERSTATUS))
6	=file("orders_21.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT)
7	>A6.append(A5)

3. Code after data conversion

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:

	A
1	>env(o_orderstatus,file("o_orderstatus.txt").import@si())
2	>env(nation, file("nation_21.btx").import@b())
3	>env(supplier, file("supplier_21.ctx").open().import())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:

	A
1	=now()
2	>name="CHINA"
3	=o_orderstatus.pos@b("F")
4	=nation.select@1(N_NAME==name).N_NATIONKEY
5	=supplier.@m(if(S_NATIONKEY==A4,S_NAME,null))
6	=file("orders_21.ctx").open().cursor@m(O_ORDERKEY;O_ORDERSTATUS==A3)
7	=file("lineitem_21.ctx").open().news@r(A6,L_ORDERKEY:ORDERKEY;A5(L_SUPPKEY))
8	=file("lineitem_21.ctx").open().news(A7:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_COMMITDATE,L_RECEIPTDATE)
9	=A8.group(L_ORDERKEY)
10	=A9.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.#2) && !t.pselect(#2!=s) && ~.pselect(#2!=s),t.select(A5(L_SUPPKEY)),null) )
11	=A10.groups@u(L_SUPPKEY:s_name;count(1):numwait).run@m(s_name=A5(s_name))
12	=A11.top(100;-numwait,s_name)
13	=interval@ms(A1,now())

Ⅳ Using enterprise edition’s column-wise computation

1. Original data

	A
1	=now()
2	>name="CHINA"
3	=file("nation.btx").import@b().select@1(N_NAME==name).N_NATIONKEY
4	=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NAME;S_NATIONKEY==A3).fetch().keys@im(S_SUPPKEY)
5	=file("orders.ctx").open().cursor@mv(O_ORDERKEY;O_ORDERSTATUS=="F")
6	=file("lineitem.ctx").open().news@r(A5,L_ORDERKEY:ORDERKEY,min(L_SUPPKEY):SK,count(1):cnt; A4.find(L_SUPPKEY) && L_RECEIPTDATE>L_COMMITDATE)
7	=file("lineitem.ctx").open().news@r(A6:ORDERKEY,L_ORDERKEY,SK,cnt,count(L_RECEIPTDATE>L_COMMITDATE && SK!=L_SUPPKEY):c1,count(SK!=L_SUPPKEY):c2).select@v(c1==0 && c2!=0)
8	=A7.groups@u(SK;sum(cnt):numwait).join(SK,A4,S_NAME).new(S_NAME:s_name,numwait)
9	=A8.top(100;-numwait,s_name)
10	=interval@ms(A1,now())

2. Optimized data

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:

	A
1	>env(o_orderstatus,file("o_orderstatus.txt").import@si())
2	>env(nation, file("nation_21.btx").import@bv())
3	>env(supplier, file("supplier_21.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load small dimension tables into memory.

Computing code:

	A
1	=now()
2	>name="CHINA"
3	=o_orderstatus.pos@b("F")
4	=nation.select@1(N_NAME==name).N_NATIONKEY
5	=supplier.(if(S_NATIONKEY==A4,S_NAME,null))
6	=file("orders_21.ctx").open().cursor@mv(O_ORDERKEY;O_ORDERSTATUS==A3)
7	=file("lineitem_21.ctx").open().news@r(A6,L_ORDERKEY:ORDERKEY,min(L_SUPPKEY):SK,count(1):cnt; A5(L_SUPPKEY) && L_RECEIPTDATE>L_COMMITDATE)
8	=file("lineitem_21.ctx").open().news@r(A7:ORDERKEY,L_ORDERKEY,SK,cnt,count(L_RECEIPTDATE>L_COMMITDATE && SK!=L_SUPPKEY):c1,count(SK!=L_SUPPKEY):c2).select@v(c1==0 && c2!=0)
9	=A8.groups@u(SK;sum(cnt):numwait).new(A5(SK):s_name,numwait)
10	=A9.top(100;-numwait,s_name)
11	=interval@ms(A1,now())

Ⅴ Test result

Unit: Second

	Regular	Column-wise
Before optimization	19.9	14.0
After optimization	14.3	9.4

SPL Official Website 👉 http://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc

SPL Learning Material 👉 http://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/ydhVnFH9

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

Jerry • 284 View • 1 Years ago