Performance Optimization Exercises Using TPC-H – Q1

 

Ⅰ SQL code and analysis

Below is the SQL query statement:

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
    l_shipdate <= date '1995-12-01' - interval '90' day(3)
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;

This is an ordinary, small-result-set grouping & aggregate computation.

Ⅱ SPL solution

SPL uses groups() function to achieve an ordinary, small-result-set grouping & aggregate computation:


A

1

=now()

2

1995-12-01

3

=A2-90

4

=file("lineitem.ctx").open().cursor@m(L_SHIPDATE,L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;L_SHIPDATE<=A3)

5

=A4.groups(L_RETURNFLAG, L_LINESTATUS; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(dp=L_EXTENDEDPRICE*(1-L_DISCOUNT)):sum_disc_price, sum(dp*L_TAX):sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

6

=A5.run(sum_charge+=sum_disc_price)

7

=interval@ms(A1,now())

A4 uses @m option with cursor() function to generate a multicursor and perform parallel computation. Writing filtering condition in the cursor can avoid generating records that do not meet the condition.

In A5, the discount price is used twice, and it is computed first to assign variable dp to avoid a repeated computation.

After grouping, A6 performs run operation to add charge to the price so that there will be less additions (the number of records after grouping is much less than that in the original table).

Ⅲ Further optimization

1. Optimization method

1.1 Converting date to integer

Date data generally has long format. It takes long to compute and compare them. In SPL, we use days@ofunction to convert date to small integer in order to improve performance. In our example, L_SHIPDATE field contains date data, the optimization method is to convert them to small integers.

1.2 Converting string to integer

It also takes longer to compute strings than to compute integers. When there are a lot of rows in a data table and a small number of distinct values in its certain string field, we can generate a list table using these distinct values and replace the string field values with corresponding ordinal numbers in the list table. The conversion to an integer field can increase the computing efficiency. In this example, both grouping fields L_RETURNFLAG and L_LINESTATUS are string fields. They have 3 and 2 distinct values respectively, and can be transformed to integer fields.

2. Code for data conversion


A

1

=file("lineitem.ctx").open().cursor(L_RETURNFLAG, L_LINESTATUS)

2

=A1.id(L_RETURNFLAG, L_LINESTATUS).(~.sort())

3

=file("l_returnflag.txt").export(A2(1))

4

=file("l_linestatus.txt").export(A2(2))

5

=file("lineitem.ctx").open().cursor().run(L_RETURNFLAG =A2(1).pos@b(L_RETURNFLAG), L_LINESTATUS =A2(2).pos@b(L_LINESTATUS), L_SHIPDATE=days@o(L_SHIPDATE))

6

=file("lineitem_1.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)

7

>A6.append(A5)

A2 Generate a value list table for the two string fields and sort it.

A5 Use pos@b() function to get corresponding ordinal number for each string, and use days@o function to convert L_SHIPDATE field values to small integers.

3. Code after data conversion


A

1

>l_linestatus=file("l_linestatus.txt").import@si()

2

>l_returnflag=file("l_returnflag.txt").import@si()

3

=now()

4

=l_linestatus.len()+1

5

1995-12-01

6

=days@o(A5-90)

7

=file("lineitem_1.ctx").open().cursor@m(L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;L_SHIPDATE<=A6)

8

=A7.groups(L_RETURNFLAG*A4+L_LINESTATUS:gk; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(dp=L_EXTENDEDPRICE*(1-L_DISCOUNT)):sum_disc_price, sum(dp*L_TAX):sum_charge,sum(L_DISCOUNT):sum_disc, count(1):count_order)

9

=A8.new(l_returnflag(gk\A4):L_RETURNFLAG,l_linestatus(gk%A4):L_LINESTATUS,sum_qty,sum_base_price,sum_disc_price,sum_charge+sum_disc_price:sum_charge,sum_qty/count_order:avg_qty,sum_base_price/count_order:avg_price,sum_disc/count_order:avg_disc,count_order)

10

=interval@ms(A3,now())

A1/A2 Load data from L_LINESTATUSfield and L_ RETURNFLAG field as a list of string values.

A6 Convert date in the filtering condition to a small integer.

A8 In groups() function, compute a new grouping key value gk using values of L_ RETURNFLAGand L_LINESTATUS, converting the grouping operation by two fields into one by gk field. This can speed up the grouping action. This grouping & aggregation does not compute avg_qty,avg_priceand avg_disc. Instead, these aggregates will be computed in A9 through related sum and count values. This way, the grouping operation becomes not so computation-intensive.

A9 Restore values of L_ RETURNFLAGand L_LINESTATUS through gk, and convert them into corresponding string values in the list table.

Ⅳ Using enterprise edition’s column-wises computation

1. Original data


A

1

=now()

2

1995-12-01

3

=A2-90

4

=file("lineitem.ctx").open().cursor@mv(L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;L_SHIPDATE<=A3)

5

=A4.derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp)

6

=A5.groups(L_RETURNFLAG, L_LINESTATUS; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price,sum(dp):sum_disc_price, sum(dp*L_TAX):sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

7

=A6.new(L_RETURNFLAG,L_LINESTATUS,sum_qty,sum_base_price,sum_disc_price,sum_disc_price+sum_charge:sum_charge,avg_qty,avg_price,avg_disc,count_order)

8

=interval@ms(A1,now())

A4 @v option enables using a column-wise cursor.

In a column-wise cursor, generating new variables should be avoided during the computation. Recommend using the derive method to change a new variable to a new column. As in A5, @o option enables adding a new column to the original table sequence instead of generating a new one.

A7 does not use run() function because the function is only fit for row-wise table sequences. We should avoid using it ina column-wise computation.

2. Optimized data


A

1

>l_linestatus=file("l_linestatus.txt").import@si()

2

>l_returnflag=file("l_returnflag.txt").import@si()

3

=now()

4

=l_linestatus.len()+1

5

1995-12-01

6

=days@o(A5-90)

7

=file("lineitem_1.ctx").open().cursor@mv(L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;L_SHIPDATE<=A6)

8

=A7.derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp,L_RETURNFLAG*A4+L_LINESTATUS:gk)

9

=A8.groups(gk;sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price,sum(dp):sum_disc_price,sum(dp*L_TAX):sum_charge, sum(L_DISCOUNT):sum_disc, count(1):count_order)

10

=A9.new(l_returnflag(gk\A4):L_RETURNFLAG,l_linestatus(gk%A4):L_LINESTATUS,sum_qty,sum_base_price,sum_disc_price,sum_disc_price+sum_charge:sum_charge,sum_qty/count_order:avg_qty,sum_base_price/count_order:avg_price,sum_disc/count_order:avg_disc,count_order)

11

=interval@ms(A3,now())

A8 creates two new columns dp and gk at the same time.

Ⅴ Test result

Unit:Second


Regular

Column-wise

Before optimization

22.2

11.1

After optimization

18.6

7.3