Performance Optimization Exercises Using TPC-H – Q2

 

SQL code andanalysis

Below is the SQL query statement:

select * from (
    select
        s_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_comment
    from part,supplier,partsupp,nation,region
    where
        p_partkey = ps_partkey
        and s_suppkey = ps_suppkey
        and p_size = 25
        and p_type like '%COPPER'
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'ASIA'
        and ps_supplycost = (
            select
                min(ps_supplycost)
            from
                partsupp,
                supplier,
                nation,
                region
            where
                p_partkey = ps_partkey
                and s_suppkey = ps_suppkey
                and s_nationkey = n_nationkey
                and n_regionkey = r_regionkey
                and r_name = 'ASIA'
        )
    order by
        s_acctbal desc,n_name,s_name,p_partkey
)
where rownum <= 100;

Let’s examine the SQL statement. If we regard the following subquery

select
    *
from
    part,partsupp,supplier,nation,region
where
    p_partkey = ps_partkey
    and s_suppkey = ps_suppkey
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = 'ASIA'
    and p_size = 25
    and p_type like '%COPPER'

as a view V, the body of the original query statement can be written as:

select
    s_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_comment
from V
where
    ps_supplycost = (
        select
            min(ps_supplycost)
        from
            V V1
        where
            V.p_partkey = V1.p_partkey
    )

Thus, the query becomes one on a single table. This amounts to search V for a certain record whose ps_supplycost values is the smallest among all records whose partkey values are the same as that of this record. The essence of the computation is grouping V by partkey, performing aggregation on each group, and getting the record whose ps_supplycost value is the smallest from each group. SQL does not support aggregation in this way, it can only write the operation using a subquery.

SPLsolution

SPL has set data type and reference data type to be able to directly perform aggregation and get the record having the smallest value.


A

1

=now()

2

>size=25

3

>type="COPPER"

4

>name="ASIA"

5

=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY

6

=file("nation.btx").import@b().select(N_REGIONKEY==A5).derive@o().keys(N_NATIONKEY)

7

=file("part.ctx").open().cursor@m(P_PARTKEY,P_MFGR;P_SIZE==size && pos@t(P_TYPE,type)).fetch().keys@im(P_PARTKEY)

8

=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME,S_ADDRESS,S_NATIONKEY,S_PHONE,S_ACCTBAL,S_COMMENT;S_NATIONKEY:A6).fetch().keys@im(S_SUPPKEY)

9

=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_PARTKEY:A7,PS_SUPPKEY:A8)

10

=A9.group(PS_PARTKEY).conj(~.minp@a(PS_SUPPLYCOST))

11

=A10.new(PS_SUPPKEY.S_ACCTBAL,PS_SUPPKEY.S_NAME,PS_SUPPKEY.S_NATIONKEY.N_NAME,PS_PARTKEY.P_PARTKEY,PS_PARTKEY.P_MFGR,PS_SUPPKEY.S_ADDRESS,PS_SUPPKEY.S_PHONE,PS_SUPPKEY.S_COMMENT)

12

=A11.total(top(100;-S_ACCTBAL,N_NAME,S_NAME,P_PARTKEY))

13

=interval@ms(A1,now())

In the code, A5-A9 defines the cursor of the above-mentioned view V.

In A10, partsupp table is ordered by PS_PARTKEY field. We can use the order-based grouping function group()to get the result set in the form of cursor, reducing memory space and avoiding hash value computation & comparison. minp() function gets the record containing the smallest value (not the smallest value itself) from each group.

Besides, A7 uses the technique of writing filtering condition in the cursor, which is explained in essay Q1. A8 and A9 combine this technique with switch@i method (the second group of parameters) to perform foreign-key-based matching at cursor creation and directly filter away the corresponding record of non-matching key value, without retrieving the other fields and generating the record. If the foreign key value is matched, convert the join field value into a pointer.

Further optimization

1. Data conversion

1.1 Dimension table preloading

We can load a dimension table that is not so large into memory in advance for use in complex queries.

1.2 Converting string to integer

The optimization method is explained in essay Q1. In this example, P_TYPE field also undergoes the string-to-integer conversion.

1.3 Numberizing the dimension table’s primary key

If a dimension table’s primary key values are natural numbers starting from 1 (which means they are row numbers the records correspond to), we can use key values to locate dimension table records directly according to their row numbers, without computing and comparing HASH values. This can speed up the dimension table join operation. In addition, locating records directly through ordinal numbers does not require creating index, which considerably reduces memory space usage.

2. Code for data conversion

2.1 Conversion on regiontable


A

1

=file("region.btx").import@b()

2

=A1.run(R_REGIONKEY=#)

3

=file("region_2.btx").export@b(A2)

A2 Convert values of primary key R_REGIONKEY into row numbers.

A3 Store data table after conversion in region_2.btx.

2.2 Conversion on nationtable


A

1

=file("nation.btx").import@b()

2

=file("region.btx").import@b()

3

=A1.run(N_NATIONKEY=#, N_REGIONKEY=A2.pselect(R_REGIONKEY==N_REGIONKEY))

4

=file("nation_2.btx").export@b(A3)

A3 Set values of primary key N_NATIONKEY as row numbers, and values of N_REGIONKEY as corresponding row numbers in region table.

2.3 Conversion on parttable


A

1

=file("part.ctx").open().cursor().fetch()

2

=A1.id(P_TYPE).sort()

3

=file("p_type.txt").export(A2)

4

=A1.run(P_PARTKEY=#, P_TYPE=A2.pos@b(P_TYPE))

5

=file("part_2.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT)

6

>A5.append(A4.cursor())

A2 Use id() function to get a list table of distinct P_TYPE field values, and sort them.

A3 Store A2’s sorted list table in p_type.txt.

A4 Set values of primary key P_PARTKEY as row numbers, and change P_TYPE values into corresponding ordinal numbers in the list table of distinct values.

2.4 Conversion on suppliertable


A

1

=file("supplier.ctx").open().cursor().fetch()

2

=file("nation.btx").import@b()

3

=A1.run(S_SUPPKEY=#,S_NATIONKEY=A2.pselect(N_NATIONKEY==S_NATIONKEY))

4

=file("supplier_2.ctx").create(#S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL, S_COMMENT)

5

>A4.append(A3.cursor())

2.5 Conversion on partsupptable


A

1

=file("partsupp.ctx").open().cursor()

2

=file("part.ctx").open().import(P_PARTKEY).keys@im(P_PARTKEY)

3

=file("supplier.ctx").open().import(S_SUPPKEY).keys@im(S_SUPPKEY)

4

=A1.run(PS_PARTKEY=A2.pfind(PS_PARTKEY), PS_SUPPKEY=A3.pfind(PS_SUPPKEY))

5

=file("partsupp_2.ctx”).create@p(#PS_PARTKEY, #PS_SUPPKEY,PS_AVAILQTY, PS_SUPPLYCOST,PS_COMMENT)

6

>A5.append(A4)

A1 As partsupp table is large, we load its data in the way of cursor.

A2 Retrieve P_PARTKEY column from part table, and set it as the primary key and create index on it.

A4 Search A2 for row number of the record whose primary key value is PS_PARTKEY and assigns it to PS_PARTKEY field. Same handling for PS_SUPPKEY field.

3. Code after data conversion

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(region, file(“region_2.btx”).import@b())

2

>env(nation, file("nation_2.btx").import@b())

3

>env(part, file("part_2.ctx").open().import())

4

>env(supplier, file("supplier_2.ctx").open().import())

5

>env(p_type,file("p_type.txt").import@si())

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:


A

1

=now()

2

>size=25

3

>type="COPPER"

4

>name="ASIA"

5

=nation.(region(N_REGIONKEY).R_NAME==name)

6

=part.@m(if(P_SIZE==size && pos@t(p_type(P_TYPE),type),~,null))

7

=supplier.@m(if(A5(S_NATIONKEY),~,null))

8

=file("partsupp_2.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_PARTKEY:A6:#, PS_SUPPKEY:A7:#)

9

=A8.group(PS_PARTKEY).conj(~.minp@a(PS_SUPPLYCOST))

10

=A9.new@m(PS_SUPPKEY.S_ACCTBAL,PS_SUPPKEY.S_NAME,nation(PS_SUPPKEY.S_NATIONKEY).N_NAME, PS_PARTKEY.P_PARTKEY,p_mfgr(PS_PARTKEY.P_MFGR):P_MFGR,PS_SUPPKEY.S_ADDRESS, PS_SUPPKEY.S_PHONE, PS_SUPPKEY.S_COMMENT)

11

=A10.total(top(100;-S_ACCTBAL,N_NAME,S_NAME,P_PARTKEY))

12

=interval@ms(A1,now())

Use nation.(region(N_REGIONKEY).R_NAME==name)to get a sequence having same length as the number of records and consisting of values of true and false. The sequence is an alignment sequence. As supplier table’s S_NATIONKEY has been numberized and its values correspond to row numbers of nation table records, A7 uses A5(S_NATIONKEY) to judge whether the current row in the supplier table meets the filtering condition. Both A6 and A7 are alignment sequences, which take the original record as a member when it meets the filtering condition and use null if the original record does not meet the condition.

In both A6 and A7, @m enables multithreaded processing.

In A8, PS_PARTKEY:A6:# means performing the association directly by associating the current partsupp row’s PS_PARTKEY value with the corresponding A6’s row number, avoiding hash computation and comparison. If the current row’s corresponding member in A6 is a null, do not retrieve it. In the first case, convert PS_PARTKEY into a pointer pointing to A6’s associated row. Similar operation for PS_SUPPKEY:A7:#.

Using enterprise edition’s column-wises computation

1. Original data


A

1

=now()

2

>size=25

3

>type="COPPER"

4

>name="ASIA"

5

=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY

6

=file("nation.btx").import@b().select(N_REGIONKEY==A5).derive@o().keys(N_NATIONKEY)

7

=file("part.ctx").open().cursor@mv(P_PARTKEY,P_MFGR;P_SIZE==size && pos@t(P_TYPE,type)).fetch().keys@im(P_PARTKEY)

8

=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NAME,S_ADDRESS,S_NATIONKEY,S_PHONE,S_ACCTBAL,S_COMMENT;S_NATIONKEY:A6).fetch().keys@im(S_SUPPKEY)

9

=file("partsupp.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_PARTKEY:A7,PS_SUPPKEY:A8)

10

=A9.group(PS_PARTKEY).conj(~.minp@a(PS_SUPPLYCOST))

11

=A10.new(PS_SUPPKEY.S_ACCTBAL,PS_SUPPKEY.S_NAME,PS_SUPPKEY.S_NATIONKEY.N_NAME,PS_PARTKEY.P_PARTKEY,PS_PARTKEY.P_MFGR,PS_SUPPKEY.S_ADDRESS,PS_SUPPKEY.S_PHONE,PS_SUPPKEY.S_COMMENT)

12

=A11.total(top(100;-S_ACCTBAL,N_NAME,S_NAME,P_PARTKEY))

13

=interval@ms(A1,now())

The cursor() function in A7, A8 and A9 works with @v option to enable retrieving data using column-wise cursor.

2. Optimized data

The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:


A

1

>env(region, file(“region_2.btx”).import@bv())

2

>env(nation, file("nation_2.btx").import@bv())

3

>env(part, file("part_2.ctx").open().import@v())

4

>env(supplier, file("supplier_2.ctx").open().import@v())

5

>env(p_type,file("p_type.txt").import@si())

The @v option is needed to preload each dimension table as a column-wise table sequence.

Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.

Computing code:


A

1

=now()

2

>size=25

3

>type="COPPER"

4

>name="ASIA"

5

=nation.(region(N_REGIONKEY).R_NAME==name)

6

=part.@m(P_SIZE==size && pos@t(p_type(P_TYPE),type))

7

=supplier.(A5(S_NATIONKEY))

8

=file("partsupp_2.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;A6(PS_PARTKEY),A7(PS_SUPPKEY))

9

=A8.group(PS_PARTKEY).conj(~.minp@a(PS_SUPPLYCOST))

10

=A9.new(supplier(PS_SUPPKEY):s,part(PS_PARTKEY):p).new(s.S_ACCTBAL,s.S_NAME,nation(s.S_NATIONKEY).N_NAME, p.P_PARTKEY,p_mfgr(p.P_MFGR):P_MFGR,s.S_ADDRESS, s.S_PHONE, s.S_COMMENT)

11

=A10.total(top(100;-S_ACCTBAL,N_NAME,S_NAME,P_PARTKEY))

12

=interval@ms(A1,now())

Why does A6 uses @m option while A7 does not? Computing T.@m(x) expression on a column-wise table sequence will make multiple copies of T (the number is that of the parallel threads). But, if expression x is simple, the multithreaded computation could have lower efficiency than the single-threaded one. In this example, A6 has a complex expression while A7’s expression is simple.

For column-wise computations, it is best not to use switch action. A8 uses A6(PS_PARTKEY) to filter data only, while with the original data table, the code uses PS_PARTKEY:A6:# to perform association and filtering at the same time. Here PS_PARTKEY is numberized, we can directly get target record from part table according to row number without establishing the association. This makes better use of the advantages of column-wise computation.

A10 first uses PS_SUPPKEY to get corresponding supplier table record and generate s column and PS_PARTKEY to get corresponding part table record and generate p column. Then it uses new() function to find the target column data to be displayed through s column and p column.

Test result

Unit:Second


Regular

Column-wise

Before optimization

2.2

1.4

After optimization

1.7

1.0