Performance Optimization: Group by the Second Sorting Field
I. Problem introduction & solution
When data set T is already ordered by field a and field b, b values that correspond to same a value are ordered. So if we need to sort or group T by b, this is the case that the sorting field/grouping field is ordered in each group by the first field.
A fast sorting algorithm sorts records group by group recursively and then merges the results. If a data set is already ordered by the second field, the algorithm can make the sorting really fast. After sorting a data set by the second sorting field, we can optimize grouping using the technique explained in Performance Optimization: Orderbased Grouping.
The following tests the SPL grouping technique by secondsortingfield and compares it with Oracle hashing technique.
II. Test environment & computing scenario
The test computer has two Intel2670 16core processors of 2.6 GHz, a 64GB RAM and an SSD, where a 16core virtual machine with 8GB RAM is running.
We create data table sales on the virtual machine. It has 4 fields – orderdate, area (string type), salesman (string type) and amount (real number) – and one billion records. We load it to Oracle and generate a esProc SPL composite table to do the tests.
The data table is ordered by orderdate, area and salesman in ascending order. The task is to calculate sales amount for every salesperson in each area. This requires grouping by area and salesman and there are one million groups in the result set. Since it takes Oracle very long to output such a big result set, we filter the result set to output orders whose total amounts are less than 471000 CNY. The number is 11 and it takes no time to output them.
III. Tests
1. With Oracle
Test SQL query:
select * from (
select /*+ parallel(n) */
area, salesman, sum(amount) as amount
from sales
group by area, salesman
) where amount<471000;
/*+ parallel(n) */ is for parallel test, where n is the number of parallel tasks.
2. SPL grouping by the second sorting field
Test SPL script:
A 

1 
=now() 
2 
=file("/home/ctx/sales.ctx").create().cursor@m(area,salesman,amount;;1) 
3 
=A2.groups@h(area,salesman;sum(amount):amount).select(amount<471000) 
4 
=interval@s(A1,now()) 
groups function works with @h option to imply that the grouping field is ordered in each group and enable a fast sorting algorithm that sorts records by the grouping field first and then the use of orderbased grouping technique.
The SPL grouping by the second sorting field is performed all in memory. This requires that the memory be able to hold the grouping result set and n result set when multithreaded processing is used (n is the number of parallel threads).
3. SPL HASH grouping
To use the hash grouping in SPL, just remove @h option from groups function.
4. Test results & explanation
Test results (Unit: second):
Number of parallel tasks 
1 
2 
4 
Oracle 
387 
195 
104 
SPL (HASH) 
405 
208 
121 
SPL (Group by second sorting field) 
252 
142 
83 
The SPL algorithm is over 50% faster than hash grouping in both SPL and Oracle. It’s normal that the Javabased SPL is slightly slower than Cbased Oracle in performing hash grouping (all columns are involved in the tests and SPL’s columnwise storage shows no strength).