Performance Optimization Exercises Using TPC-H – Q22

 

SQL code and analysis

Below is the SQL query statement:

select
    cntrycode,
    count(*) as numcust,
    sum(c_acctbal) as totacctbal
from (
    select
        substr(c_phone,1,2) as cntrycode,
        c_acctbal
    from
        customer
    where
        substr(c_phone,1,2) in ('11', '14', '15', '19', '20', '21', '23')
        and c_acctbal > (
            select
                avg(c_acctbal)
            from
                customer
            where
                c_acctbal > 0.00
                and substr(c_phone,1,2) in    ('11', '14', '15', '19', '20', '21', '23')
        )
        and not exists (
            select
                *
            from
                orders
            where
                o_custkey = c_custkey
        )
    ) custsale
group by
    cntrycode
order by
    cntrycode;

The main query’s outer layer is a regular grouping and aggregation operation, and its inner layer contains two conditional subqueries.

SPL solution

The two subqueries, where the second one is preceded by not exists, correspond to certain customer records meeting their respective condition. The final result should be the difference set of customer record sets corresponding to the two subqueries.


A

1

=now()

2

=["11","14","15","19","20","21","23"]

3

=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_PHONE,C_ACCTBAL;C_ACCTBAL>0.0 && A2.contain(left(C_PHONE,2))).fetch()

4

=A3.avg(C_ACCTBAL)

5

=A3.select@m(C_ACCTBAL>A4).derive@o().keys@im(C_CUSTKEY)

6

=file("orders.ctx").open().cursor@m(O_CUSTKEY;O_CUSTKEY:A5)

7

=A6.run(O_CUSTKEY.C_CUSTKEY=null).skip()

8

=A5.select@m(C_CUSTKEY)

9

=A8.groups(left(C_PHONE,2):cntrycode;count(1):numcust, sum(C_ACCTBAL):totacctbal)

10

=interval@ms(A1,now())

A5 selects records meeting condition specified in the first subquery. A6 associates with A5 and selects target records for the second subquery. A7 resets C_CUSTKEY values of all these selected records as null, which means deleting them from A5, in order to achieve the not exists condition on the second subquery. A8 selects records satisfying the two conditional subqueries.

Further optimization

1. Optimization method

In this example, we will use dimension table primary key numberization method explained in Q2 – customer table’s C_CUSTKEY field and orders table’s O_CUSTKEY field have been converted in the previous examples.

2. Code for data conversion

Copy customer_18.ctx and orders_21.ctx, and rename them customer_22.ctx and orders_22.ctx respectively.

3. Code after data conversion

First, we need to preload the dimension table. Below is preloading code:


A

1

>env(customer, file("customer_22.ctx").open().import())

Before performing the query, we need to first run the preloading code to load the small dimension table into memory.

Computing code:


A

1

=now()

2

=["11","14","15","19","20","21","23"]

3

=customer.@m(if(C_ACCTBAL>0.0 && A2.contain(left(C_PHONE,2)),C_ACCTBAL,null))

4

=A3.avg()

5

=A3.@m(~>A4)

6

=file("orders_22.ctx").open().cursor@m(O_CUSTKEY;A5(O_CUSTKEY)).run(A5(O_CUSTKEY)=false).skip()

7

=customer(A5.pselect@am(~)).groups@m(left(C_PHONE,2):cntrycode;count(1):numcust, sum(C_ACCTBAL):totacctbal)

8

=interval@ms(A1,now())

Using enterprise edition’s column-wise computation

1. Original data


A

1

=now()

2

=["11","14","15","19","20","21","23"]

3

=file("customer.ctx").open().cursor@mv(C_CUSTKEY,C_PHONE,C_ACCTBAL;C_ACCTBAL>0.0 && A2.contain(left(C_PHONE,2))).fetch()

4

=A3.avg(C_ACCTBAL)

5

=A3.select@mv(C_ACCTBAL>A4).derive@o().keys@im(C_CUSTKEY)

6

=file("orders.ctx").open().cursor@mv(O_CUSTKEY;O_CUSTKEY:A5)

7

=A6.run(O_CUSTKEY.C_CUSTKEY=null).skip()

8

=A5.select@mv(C_CUSTKEY)

9

=A8.groups@m(left(C_PHONE,2):cntrycode;count(1):numcust, sum(C_ACCTBAL):totacctbal)

10

=interval@ms(A1,now())

2. Optimized data

First, we need to preload the dimension table. Below is preloading code:


A

1

>env(customer, file("customer_22.ctx").open().import@v())

Before performing the query, we need to first run the preloading code to load the small dimension table into memory.

Computing code:


A

1

=now()

2

=["11","14","15","19","20","21","23"]

3

=customer.@m(if(C_ACCTBAL>0.0 && A2.contain(left(C_PHONE,2)),C_ACCTBAL,null))

4

=A3.avg()

5

=A3.@m(~>A4)

6

=file("orders_22.ctx").open().cursor@mv(O_CUSTKEY;A5(O_CUSTKEY)).run(A5(O_CUSTKEY)=false).skip()

7

=customer(A5.pselect@am(~)).groups@m(left(C_PHONE,2):cntrycode;count(1):numcust, sum(C_ACCTBAL):totacctbal)

8

=interval@ms(A1,now())

Test result

Unit: Second


Regular

Column-wise

Before optimization

7.5

4.2

After optimization

3.9

2.2