Performance optimization skill: Order-based Targeted MERGE

 

I. Problem introduction & Solution

  The order-based MERGE algorithm has proved a success in increasing the JOIN performance (See Performance Optimization Tricks: Order-based MERGE). Yet we always want to go faster, by reading in as few records as possible this time.

  Sometimes the join of a main table and its subtable is followed by a filtering. The order-based MERGE deals with this by first handling the filtering and then the join. It retrieves the main table and the subtable and performs filtering over one of them. The filtered table is left with only the targeted records while the other table still contains mismatched records, a lot of them at many occasions, before the MERGE is executed. Those mismatched records will thus be still read in and it will take time to read them. To make the MERGE join faster, we need to devise a scheme to reduce the read actions. We will locate the matching records according to the primary key values of the filtered table and then perform the MERGE join over the abridged versions of the two tables. This is achievable because a SPL composite table stores data in order by the key, according to which we can efficiently locate corresponding records.

  We’ll do tests to compare this order-based targeted MERGE scheme with the order-based MERGE trick to see whether there is performance improvement. Explanations will be made about the resuls.

 

II. Test environment

  There are two test computers. Each has a 16-core, 2.6 GHz processor, a 128GB RAM and an SSD.

  A total of 200G data has been generated according to TPCH benchmark. The main table is called orders and the subtable is lineitem. The two tables are respectively ordered by O_ORDERKEY and L_ORDERKEY in ascending order. Since the data is big and it takes long to handle it with a single-threaded processing, we use a 4-thread parallel processing to do the tests.

 

III. Filtering over the main table

  The filtering condition is O_TOTALPRICE>price. Parameter price is used to test the scheme’s performance with different data sizes after the main table is filtered.

1. Order-based MERGE

  SPL script:


A

1

=now()

2

=file("/home/ctx/orders.ctx").create().cursor@m(O_ORDERKEY,O_ORDERDATE;O_TOTALPRICE>price;4)

3

=file("/home/ctx/lineitem.ctx").create().cursor(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT;;A2)

4

=joinx(A3:detail,L_ORDERKEY;A2:orders,O_ORDERKEY)

5

=A4.groups(year(orders.O_ORDERDATE):year;

sum(detail.L_EXTENDEDPRICE * (1 - detail.L_DISCOUNT)):revenue)

6

=interval@s(A1,now())

2. Order-based targeted MERGE

  SPL script:


A

1

=now()

2

=file("/home/ctx/orders.ctx").create().cursor@m(O_ORDERKEY,O_ORDERDATE;O_TOTALPRICE>price;4)

3

=file("/home/ctx/lineitem.ctx").create().news(A2,L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE)

4

=A3.groups(year(O_ORDERDATE):year;

sum(L_EXTENDEDPRICE  * (1   - L_DISCOUNT)):revenue)

5

=interval@s(A1,now())

  Instead of using the cursor()function to generate a subtable cursor and then performing a MERGE join with joinx() function, A3 uses news() function to directly generate a cursor of the join of the main table cursor and the subtable using the order-based targeted MERGE algorithm.

3. Test results & explanations

  With a main table (orders) containing 3 billion records, the test results are as follows:

Record count in a filtered   main table

2 billion

1.4 billion

0.9 billion

0.17 billion

Order-based MERGE (sec)

131

115

94

57

Order-based Targeted MERGE (sec)

105

87

69

35

  Both SPL scripts create a main table cursor in the same way. With the order-based MERGE algorithm, a subtable cursor is created by retrieving all the three fields for each record and then an order-based MERGE join is performed over the two cursors using joinx() function. The order-based targeted MERGE algorithm searches for records in the subtable by matching its join key values to the key values in the filtered main table. A composite table stores data by blocks and its index headers record the high value and low value for each block of key values. If a key value in the main table is greater than the high value of the current block of key values in the sub table, just skip it to move onto the next block to match until a block containing this key value is found. The matching only compares the key values. The other two fields will be read only when the key values match.

  The test results show that the order-based targeted MERGE scheme can considerably increase performance no matter what data size a filtered main table has.

 

IV Filtering over the subtable

  The filtering condition is L_QUANTITY >quantity. Parameter quantity is used to test the scheme’s performance with different data sizes after the subtable is filtered.

1. Order-based MERGE

  SPL script:


A

1

=now()

2

=file("/home/ctx/orders.ctx").create().cursor@m(O_ORDERKEY,O_ORDERDATE;;4)

3

=file("/home/ctx/lineitem.ctx").create().cursor(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_QUANTITY>quantity;A2)

4

=joinx(A3:detail,L_ORDERKEY;A2:orders,O_ORDERKEY)

5

=A4.groups(year(orders.O_ORDERDATE):year;  sum(detail.L_EXTENDEDPRICE * (1 - detail.L_DISCOUNT)):revenue)

6

=interval@s(A1,now())

2. Order-based targeted MERGE

  SPL script:


A

1

=now()

2

=file("/home/ctx/lineitem.ctx").create().cursor@m(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_QUANTITY>  quantity;4)

3

=file("/home/ctx/orders.ctx").create().new(A2,O_ORDERKEY,O_ORDERDATE,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):v)

4

=A3.groups(year(O_ORDERDATE):year;  sum(v):revenue)

5

=interval@s(A1,now())

  3 uses new() function to create cursor for the main table and sum() function to generate a computed field for the main table by grouping the subtable according to the main table’s key values and calculating sum for each group.

3. Test results & explanations

  With a subtable (lineitem) having 12 billion records, the test results are as follows:

Record count in a filtered   subtable

8.4 billion

7.2 billion

3.6 billion

1.2 billion

0.5 billion

Order-based MERGE (sec)

127

115

84

57

36

Order-based targeted MERGE   (sec)

92

87

72

47

31

  Here the difference is that a filtered subtable is joined with the main table using the new() function. The optimization method is the same. The filtered subtable only MERGE joins matching records in the main table rather than all of them by traversal. This cuts down the number the records to be read.

  According to the test results, the fewer the number of records in the filtered subtable, the smaller the decrease rate gets.

 

V Summary

  In a nutshell, the order-based target algorithm can make the main-sub table join query with a filtering condition even faster.