Performance Optimization: Group by the Second Sorting Field

 

I. Problem introduction & solution

  When data set T is already ordered by field a and field b, b values that correspond to same a value are ordered. So if we need to sort or group T by b, this is the case that the sorting field/grouping field is ordered in each group by the first field.

  A fast sorting algorithm sorts records group by group recursively and then merges the results. If a data set is already ordered by the second field, the algorithm can make the sorting really fast. After sorting a data set by the second sorting field, we can optimize grouping using the technique explained in Performance Optimization: Order-based Grouping.

  The following tests the SPL grouping technique by second-sorting-field and compares it with Oracle hashing technique.

 

II. Test environment & computing scenario

  The test computer has two Intel2670 16-core processors of 2.6 GHz, a 64GB RAM and an SSD, where a 16-core virtual machine with 8GB RAM is running.

  We create data table sales on the virtual machine. It has 4 fields – orderdate, area (string type), salesman (string type) and amount (real number) – and one billion records. We load it to Oracle and generate a esProc SPL composite table to do the tests.

  The data table is ordered by orderdate, area and salesman in ascending order. The task is to calculate sales amount for every salesperson in each area. This requires grouping by area and salesman and there are one million groups in the result set. Since it takes Oracle very long to output such a big result set, we filter the result set to output orders whose total amounts are less than 471000 CNY. The number is 11 and it takes no time to output them.

 

III. Tests

1. With Oracle

  Test SQL query:

    select * from (
        select /*+ parallel(n) */
            area, salesman, sum(amount) as amount
        from sales
        group by area, salesman
    ) where amount<471000;

   /*+ parallel(n) */ is for parallel test, where n is the number of parallel tasks.

2. SPL grouping by the second sorting field

  Test SPL script:


A

1

=now()

2

=file("/home/ctx/sales.ctx").create().cursor@m(area,salesman,amount;;1)

3

=A2.groups@h(area,salesman;sum(amount):amount).select(amount<471000)

4

=interval@s(A1,now())

  groups function works with @h option to imply that the grouping field is ordered in each group and enable a fast sorting algorithm that sorts records by the grouping field first and then the use of order-based grouping technique.

  The SPL grouping by the second sorting field is performed all in memory. This requires that the memory be able to hold the grouping result set and n result set when multithreaded processing is used (n is the number of parallel threads).

3. SPL HASH grouping

  To use the hash grouping in SPL, just remove @h option from groups function.

4. Test results & explanation

  Test results (Unit: second):

Number of parallel tasks

1

2

4

Oracle

387

195

104

SPL (HASH)

405

208

121

SPL (Group by second sorting field)

252

142

83

  The SPL algorithm is over 50% faster than hash grouping in both SPL and Oracle. It’s normal that the Java-based SPL is slightly slower than C-based Oracle in performing hash grouping (all columns are involved in the tests and SPL’s column-wise storage shows no strength).