Performance Optimization Skill: Second-half Ordered Grouping

 

I. Problem introduction & solution

  First of all, what is second-half ordered grouping? When data set T is already ordered by field a” and field band we want to sort or group the table T by field b, in this case, in the segment with the same avalue, field bis always ordered. So this is the scenario where the sorting field/grouping field is ordered in each segment grouped by the first-half field, and we name it as second-half ordered grouping. 

  As we all know, the quick-sort algorithm is to segment data then sort and merge recursively. For data set that is already ordered by the second-half fields, the execution speed of quick-sort algorithm can be very fast already. So if we sort the data set T by field busing the quick-sort algorithm, then we can optimize the grouping with the technique introduced in Performance Optimization Skill: Ordered Grouping.

  SPL provides such an algorithm of second-half ordered grouping. In the following part, well test it and compare it with the hashing grouping algorithm in Oracle.

 

II. Test environment & computing scenario

  The test computer has two Intel2670 CPUs, 2.6G frequency, 16 cores in total, 64G memory and an SSD hard disk, where a 16-core virtual machine with 8G memory is set for testing.

  On the virtual machine, we create the data table saleswhich consists of 4 fields - orderdate, area (strings), salesman (strings) and amount (real numbers) and 1 billion rows of records. We import the data table into the Oracle database, and use it to generate the esProc SPL composite table for testing.

  The data table is sorted by orderdate, area and salesman in ascending order. The task goal is to query the sales amount of every salesman in each area, which requires grouping by areaand salesmanand there are 1 million groups in the result set. Since it takes Oracle a lot of time to output such a big result set, we need to filter the result set again to output only the order whose total amount is less than CNY 471,000. And there are only 11 results left, so the output hardly takes up any time.

 

III. Tests

1. Oracle

  SQL query statement:

    select * from (

        select /*+ parallel(n) */

            area, salesman, sum(amount) as amount

        from sales

        group by area, salesman

    ) where amount<471000;

   /*+ parallel(n) */ is for parallel processing, where n is the number of parallel tasks.

2. Second-half ordered grouping in SPL

  SPL script:


A

1

=now()

2

=file("/home/ctx/sales.ctx").open().cursor@m(area,salesman,amount;;1)

3

=A2.groups@h(area,salesman;sum(amount):amount).select(amount<471000)

4

=interval@s(A1,now())

  The groups function with @h option indicates the grouping fields are second-half sorting fields (ordered in each segment), which is used to sort the grouping fields first with the quick-sort algorithm in SPL and then optimize the performance with ordered grouping technique.

  It is worth noting that the second-half ordered grouping is all performed in memory, which requires the memory to be able to hold the grouping result set and n result set when multi-thread processing is performed (n is the number of parallel threads).

3. HASH grouping in SPL

  To use the hash grouping in SPL, remove the @h option from the groups function in the previous script. 

4. Test results & explanation

  Test results (Unit: second):

Number of parallel threads

1

2

4

Oracle

387

195

104

SPL (HASH)

405

208

121

SPL (Second-half ordered grouping)

252

142

83

  From the test results, the second-half ordered grouping in SPL is over 50% faster than the hash grouping in both SPL and Oracle, which shows significant improvement in the performance. And its normal that the Java-based SPL ordinary grouping is slightly slower than the C-based Oracle in performing hash grouping (all columns are involved in the tests, so the columnar storage of SPL shows no strength).