Performance optimization case course: TPCH-Q1

 

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
     lineitem
where
     l_shipdate <= date '1995-12-01' - interval '90' day(3)
group by
     l_returnflag,
     l_linestatus
order by
     l_returnflag,
     l_linestatus;

This calculation does not involve inter-table associations and only involves conditional filtering, grouping and aggregation on the LINEITEM table and then sorting result.

1. Bin file

LINEITEM is a large table and needs to be stored on external storage. We first try to store it as a row-based bin file. Code for data conversion:


A

1

=file("lineitem.tbl").cursor(;,"|").new(_1:L_ORDERKEY,_4:L_LINENUMBER,_2:L_PARTKEY,_3:L_SUPPKEY,_5:L_QUANTITY,_6:L_EXTENDEDPRICE,_7:L_DISCOUNT,_8:L_TAX,_9:L_RETURNFLAG,_10:L_LINESTATUS,_11:L_SHIPDATE,_12:L_COMMITDATE,_13:L_RECEIPTDATE,_14:L_SHIPINSTRUCT, _15:L_SHIPMODE, _16:L_COMMENT)

2

=file("lineitem.btx").export@b(A1)

The fields L_RETURNFLAG and L_LINESTATUS are both strings of length 1, used to record flags and status, and the numbers of their actual values are 2 and 3 respectively.

This is a grouping operation with small result set, and we use groups() function. The calculation code is:


A

1

=now()

2

1995-12-01

3

=A2-90

4

=file("lineitem.btx").cursor@b(L_SHIPDATE,L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS).select(L_SHIPDATE<=A3)

5

=A4.groups(L_RETURNFLAG,L_LINESTATUS;sum(L_QUANTITY):sum_qty,sum(L_EXTENDEDPRICE):sum_base_price, sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):sum_disc_price, sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)):sum_charge,avg(L_QUANTITY):avg_qty,avg(L_EXTENDEDPRICE):avg_price,avg(L_DISCOUNT):avg_disc, count(1):count_order)

6

=interval@s(A1,now())

Test results:

Test items

Execution time (seconds)

Bin file

108

The purpose of the test here is to compare the speed-up effects of different methods. The specific execution time is not important. What is important is to observe whether the execution time is improved and the extent of improvement after adopting different performance optimization methods.

2. Parallel computing

Parallel computing is to use multiple threads to traverse simultaneously to improve performance:


A

1

=now()

2

1995-12-01

3

=A2-90

4

=file("lineitem.btx").cursor@mb(L_SHIPDATE,L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS).select(L_SHIPDATE<=A3)

5

=A4.groups(L_RETURNFLAG, L_LINESTATUS; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):sum_disc_price, sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)):sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

6

=interval@s(A1,now())

A4: The cursor() function is added with m option, which means this is a parallel computing.

The test machine is a 4-core virtual machine, and esProc sets the number of parallel threads to 4.

Test results:

Test items

Execution time (seconds)

Bin file

108

Parallel computing on bin file

37

3. Columnar composite table

For such traversal calculation, the performance of adopting columnar composite table should be better than that of adopting row-based bin file. Code for data conversion:


A

1

=file("lineitem.btx").cursor@b()

2

=file("lineitem3.ctx").create(L_ORDERKEY,L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)

3

>A2.append(A1)

Naming the composite table lineitem3 here indicates that it is generated in Section 3, with the main purpose of distinguishing it from the composite tables generated later.

Calculation code:


A

1

=now()

2

1995-12-01

3

=A2-90

4

=file("lineitem3.ctx").open().cursor@m(L_SHIPDATE,L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;L_SHIPDATE<=A3)

5

=A4.groups(L_RETURNFLAG, L_LINESTATUS; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):sum_disc_price, sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)):sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

6

=interval@s(A1,now())

A4: Write the filter condition in the composite table cursor, which can avoid generating records that do not meet the condition, and reduce the amount of data read from the hard disk into memory.

Test results:

Test items

Execution time (seconds)

Bin file

108

Parallel computing on bin file

37

Columnar composite table

15

4. Convert enumeration string field to numbers

String operations are more time-consuming than integer operations. When there are a lot of rows in a data table and a small number of distinct values in its certain string field, we can generate a list table using these distinct values and replace the string field values with corresponding sequence numbers in the list table. The conversion to an integer field can increase the computing efficiency.

Both L_RETURNFLAG and L_LINESTATUS are string fields. They have 3 and 2 distinct values respectively, and can be converted to integer fields. The two fields in Q1 happen to be grouping fields. In fact, enumeration string field that is not grouping field can also be converted to numbers.

Code for data conversion:


A

1

=file("lineitem3.ctx").open()

2

=A1.cursor(L_RETURNFLAG, L_LINESTATUS).id(L_RETURNFLAG, L_LINESTATUS).(~.sort())

3

=file("l_returnflag.txt").export(A2(1))

4

=file("l_linestatus.txt").export(A2(2))

5

=A1.cursor().run(L_RETURNFLAG =A2(1).pos@b(L_RETURNFLAG), L_LINESTATUS =A2(2).pos@b(L_LINESTATUS))

6

=file("lineitem4.ctx").create(L_ORDERKEY,L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT)

7

>A6.append(A5)

A2: Generate a value list table for the two string fields and sort it.

A5: Use pos@b() to get corresponding sequence number of string in the list table.

Calculation code:


A

1

>l_linestatus=file("l_linestatus.txt").import@si()

2

>l_returnflag=file("l_returnflag.txt").import@si()

3

=now()

4

1995-12-01

5

=A4-90

6

=file("lineitem4.ctx").open().cursor@m(L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;L_SHIPDATE<=A5)

7

=A6.groups(L_RETURNFLAG, L_LINESTATUS; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):sum_disc_price, sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)):sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

8

=A7.run(L_RETURNFLAG=l_returnflag(L_RETURNFLAG),L_LINESTATUS=l_linestatus(L_LINESTATUS))

9

=interval@s(A3,now())

A8: Convert enumeration string field back to their corresponding string values in the list table.

Test results:

Test items

Execution time (seconds)

Bin file

108

Parallel computing on bin file

37

Columnar composite table

15

Convert enumeration string field to numbers

12

5. Combine grouping fields

During the grouping process, the calculation and comparison of grouping fields account for a large proportion of total computation. Adding even one more grouping field will make calculation and comparison much more complicated, resulting in performance decrease. In other words, the fewer the number of grouping fields, the better the performance.

We find that the value ranges of both the grouping fields L_RETURNFLAG and L_LINESTATUS are very small, we can combine the two fields to compute a new grouping field gk. This way, the grouping operation by two fields is converted into operation by one gk field, which can speed up the grouping action.

Calculation code:


A

1

>l_linestatus=file("l_linestatus.txt").import@si()

2

>l_returnflag=file("l_returnflag.txt").import@si()

3

=now()

4

=l_linestatus.len()+1

5

1995-12-01

6

=A5-90

7

=file("lineitem4.ctx").open().cursor@m(L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;L_SHIPDATE<=A6)

8

=A7.groups(L_RETURNFLAG*A4+L_LINESTATUS:gk ; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):sum_disc_price,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)):sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

9

=A8.run(L_RETURNFLAG =l_returnflag(gk\A4), L_LINESTATUS =l_linestatus(gk%A4))

10

=interval@s(A3,now())

A9: Restore the values of L_RETURNFLAG and L_LINESTATUS through gk, and convert them into corresponding string values in the list table.

Test results:

Test items

Execution time (seconds)

Bin file

108

Parallel computing on bin file

37

Columnar composite table

15

Convert enumeration string field to numbers

12

Combine grouping fields

11

6. Convert date field to numbers

The calculation and comparison of date data is also time-consuming. To improve performance, we use days@o function to convert date to small integer.

L_SHIPDATE field is date data. Now convert its values to small integers:


A

1

=file("lineitem4.ctx").open().cursor().run(L_SHIPDATE=days@o(L_SHIPDATE))

2

=file("lineitem6.ctx").create(L_ORDERKEY,L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT)

3

>A2.append(A1)

A1: Convert the values of L_SHIPDATE field to small integers using days@o.

Calculation code:


A

1

>l_linestatus=file("l_linestatus.txt").import@si()

2

>l_returnflag=file("l_returnflag.txt").import@si()

3

=now()

4

=l_linestatus.len()+1

5

1995-12-01

6

days@o(A5-90)

7

=file("lineitem6.ctx").open().cursor@m(L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;L_SHIPDATE<=A6)

8

=A7.groups(L_RETURNFLAG*A4+L_LINESTATUS:gk ; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):sum_disc_price,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)):sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

9

=A8.run(L_RETURNFLAG =l_returnflag(gk\A4), L_LINESTATUS =l_linestatus(gk%A4))

10

=interval@s(A3,now())

A6: Use days@o to convert the date filter condition to numbers.

Test results:

Test items

Execution time (seconds)

Bin file

108

Parallel computing on bin file

37

Columnar composite table

15

Convert enumeration string field to numbers

12

Combine grouping fields

11

Convert date field to numbers

11

The optimization effect of this method is not obvious because it only involves date comparison here, without involving actions such as year, month, and day calculations. The time spent on calculating date data and small integers is too small compared to other calculations, so the effect is not reflected.

7. Avoid repeated calculations

The discount price L_EXTENDEDPRICE*(1-L_DISCOUNT) is calculated twice. To avoid repeated calculations, we can first calculate it and then assign it to a variable dp.

Do not compute the average values avg_qty, avg_price, and avg_disc in groups. Instead, calculate them after groups. This way, the grouping operation becomes not so computation-intensive.

Modified code:


A

1

>l_linestatus=file("l_linestatus.txt").import@si()

2

>l_returnflag=file("l_returnflag.txt").import@si()

3

=now()

4

=l_linestatus.len()+1

5

1995-12-01

6

=days@o(A5-90)

7

=file("lineitem6.ctx").open().cursor@m(L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;L_SHIPDATE<=A6)

8

=A7.groups(L_RETURNFLAG*A4+L_LINESTATUS:gk; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(dp=L_EXTENDEDPRICE*(1-L_DISCOUNT)):sum_disc_price, sum(dp*L_TAX):sum_charge,sum(L_DISCOUNT):sum_disc, count(1):count_order)

9

=A8.new(l_returnflag(gk\A4):L_RETURNFLAG,l_linestatus(gk%A4):L_LINESTATUS,sum_qty,sum_base_price,sum_disc_price,sum_charge+sum_disc_price:sum_charge,sum_qty/count_order:avg_qty,sum_base_price/count_order:avg_price,sum_disc/count_order:avg_disc,count_order)

10

=interval@ms(A3,now())

A10: Calculate the average values based on the results of sum and count, which can reduce the amount of grouping & aggregation calculation.

Moreover, summing up the price and tax sum_charge+sum_disc_price:sum_charge after grouping reduces the number of addition operations (because the number of records after grouping is much less than that of original table).

Test results:

Test items

Execution time (seconds)

Bin file

108

Parallel computing on bin file

37

Columnar composite table

15

Convert enumeration string field to numbers

12

Combine grouping fields

11

Convert date field to numbers

11

Avoid repeated calculations

10

8. Column-wise computing

Column-wise computing can further improve performance, but there are some points for attention, which will be explained in the following code.

For column-wise computing, there is no need to do special data conversion, so we still use lineitem5.ctx.

Calculation code:


A

1

>l_linestatus=file("l_linestatus.txt").import@si()

2

>l_returnflag=file("l_returnflag.txt").import@si()

3

=now()

4

=l_linestatus.len()+1

5

1995-12-01

6

=days@o(A5-90)

7

=file("lineitem5.ctx").open().cursor@mv(L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS;L_SHIPDATE<=A6)

8

=A7.derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp,L_RETURNFLAG*A4+L_LINESTATUS:gk)

9

=A8.groups(gk;sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price,sum(dp):sum_disc_price,sum(dp*L_TAX):sum_charge, sum(L_DISCOUNT):sum_disc, count(1):count_order)

10

=A9.new(l_returnflag(gk\A4):L_RETURNFLAG,l_linestatus(gk%A4):L_LINESTATUS,sum_qty,sum_base_price,sum_disc_price,sum_disc_price+sum_charge:sum_charge,sum_qty/count_order:avg_qty,sum_base_price/count_order:avg_price,sum_disc/count_order:avg_disc,count_order)

11

=interval@s(A3,now())

The option @v in A7 means using columnar cursor.

If a columnar cursor is used, avoid generating new variables during calculation, and use the derive function to add a new column instead of variable. For example, A8 adds the dp column, and the option @o indicates the new column is added to original table sequence without generating a new table sequence.

Test results:

Test items

Execution time (seconds)

Bin file

108

Parallel computing on bin file

37

Columnar composite table

15

Convert enumeration string field to numbers

12

Combine grouping fields

11

Convert date field to numbers

11

Avoid repeated calculations

10

Column-wise computing

4

9. Ordered storage by date

When the filter condition is a date range and only Q1 query is concerned, then sorting by date is most beneficial to improving the performance of filtering calculation.

Code for data conversion:


A

1

=file("lineitem6.ctx").open().cursor(L_SHIPDATE,L_ORDERKEY,L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT).sortx(L_SHIPDATE;1000000)

2

=file("lineitem9.ctx").create(#L_SHIPDATE,L_ORDERKEY,L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT)

3

>A2.append(A1)

A2: When defining the composite table, prefix the first field L_SHIPDATE with a # sign, which means this is an ordered dimension field.

When calculating, we still use the code of the previous section, just change the name of composite table to lineitem9.ctx and leave the rest unchanged.

SPL engine will automatically utilize the ordered field of composite table to improve performance.

Test results:

Test items

Execution time (seconds)

Bin file

108

Parallel computing on bin file

37

Columnar composite table

15

Convert enumeration string field to numbers

12

Combine grouping fields

11

Convert date field to numbers

11

Avoid repeated calculations

10

Column-wise computing

4

Ordered storage by date

2