"Ⅰ SQL code and analysis Below is the SQL query statement: select * from ( select c_name, c_custk .."

Jerry RaqForum 14 No.
326 View • 1 Years ago

Performance Optimization Exercises Using TPC-H – Q18

Ⅰ SQL code and analysis

Below is the SQL query statement:

select * from (
    select
        c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice,
        sum(l_quantity)
    from
        customer, orders, lineitem
    where
        o_orderkey in (
            select
                l_orderkey
            from
                lineitem
            group by
                l_orderkey
            having
                sum(l_quantity) > 314
        )
        and c_custkey = o_custkey
        and o_orderkey = l_orderkey
    group by
        c_name,
        c_custkey,
        o_orderkey,
        o_orderdate,
        o_totalprice
    order by
        o_totalprice desc,
        o_orderdate
) where rownum<=100;

Let’s examine the SQL statement. If we name the following subquery

select 
    l_orderkey, sum(l_quantity) lq
from 
    lineitem
group by 
    l_orderkey
View lo, the body of the original query statement can be written as:
select * from (
    select
        c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, lq
    from
        customer, orders, lo
    where
        c_custkey = o_custkey
        and o_orderkey = l_orderkey
        and lq > 314
    order by
        o_totalprice desc,
        o_orderdate
) where rownum<=100;

Now the computation on the view is a single-table grouping & aggregation operation. The main query is a multi-table association and filtering.

Ⅱ SPL solution

This is a join between orders table having a foreign key association and its homo-dimension table lo. The lineitem table, as orders table’s sub table, is also ordered by l_orderkey, and view lo, which is computed from lineitem table, should be still ordered by l_orderkey. So, we can perform a fast merge join between orders and lo.

Rather than making customer table participate in the computation from the beginning, we can associate the foreign key table with orders table based on the result set.

	A
1	=now()
2	>quantity=314
3	=file("lineitem.ctx").open().cursor@m(L_ORDERKEY,L_QUANTITY)
4	=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select(quantities>quantity)
5	=file("orders.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities)
6	=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE))
7	=file("customer.ctx").open()
8	=A6.joinx@q(O_CUSTKEY,A7:C_CUSTKEY,C_NAME)
9	return interval@ms(A1,now())

A5 uses the order-based primary-sub table merge technique explained in the previous examples. A8 uses joinx@q to associate orders table and customer table in order.

Ⅲ Further optimization

1. Optimization method

In this example, we will use the dimension table primary key numberization method explained in Q2 – customer table’s C_CUSTKEY field and orders table’s O_CUSTKEY field have been converted in the previous examples.

2. Code for data conversion

Copy customer_13.ctx, orders_13.ctx and lineitem_17.ctx and rename them customer_18.ctx, orders_18.ctx and lineitem_18.ctx respectively.

3. Code after data conversion

First, we need to preload the dimension tables. Below is preloading code:

	A
1	>env(customer, file("customer_18.ctx").open().import())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:

	A
1	=now()
2	>quantity=314
3	=file("lineitem_18.ctx").open().cursor@m(L_ORDERKEY,L_QUANTITY)
4	=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select(quantities>quantity)
5	=file("orders_18.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities)
6	=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE))
7	O_ORDERDATE=date@o(O_ORDERDATE))=A6.derive(customer(O_CUSTKEY).C_NAME:C_NAME).run(
8	=interval@ms(A1,now())

The O_ORDERDATE field in orders_18.ctx has been converted to an integer type one in the previous example, so we need to use date@o to restore it to a date type one.

Ⅳ Using enterprise edition’s column-wise computation

1. Original data

	A
1	=now()
2	>quantity=314
3	=file("lineitem.ctx").open().cursor@mv(L_ORDERKEY,L_QUANTITY)
4	=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select@v(quantities>quantity)
5	=file("orders.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities)
6	=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE))
7	=file("customer.ctx").open()
8	=A6.joinx@q(O_CUSTKEY,A7:C_CUSTKEY,C_NAME)
9	=interval@ms(A1,now())

2. Optimized data

First, we need to preload the dimension tables. Below is preloading code:

	A
1	>env(customer, file("customer_18.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:

	A
1	=now()
2	>quantity=314
3	=file("lineitem_18.ctx").open().cursor@mv(L_ORDERKEY,L_QUANTITY)
4	=A3.group@s(L_ORDERKEY;sum(L_QUANTITY):quantities).select@v(quantities>quantity)
5	=file("orders_18.ctx").open().new(A4,O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE,O_ORDERDATE,quantities)
6	=A5.total(top(100;-O_TOTALPRICE,O_ORDERDATE))
7	=A6.derive(customer(O_CUSTKEY).C_NAME:C_NAME).run(O_ORDERDATE=date@o(O_ORDERDATE))
8	=interval@ms(A1,now())

Ⅴ Test result

Unit: Second

	Regular	Column-wise
Before optimization	13.7	5.0
After optimization	13.7	4.9

The data conversion and preloading does not bring noticeable performance increase because the main computing time is spent in A3-A6, which the optimization methods do not work. The result set on which the optimization methods are employed has only 100 rows of data, which are too few to have a noticeable effect.

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/cFTcUNs7

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

Jerry • 326 View • 1 Years ago