Performance Optimization: When Data Set Is Already Ordered by First Sorting Field
I. Problem introduction & solution
You must have encountered such a scenario when sorting data sets. Suppose there are two fields a and b in data set T. T is already ordered by a but not by b. To sort T by both a and b is what we call a sorting where a data set is already ordered by the first sorting field (a is ordered). We have an optimization technique to speed up this type of sorting. Get a batch of records having same a value and sort them by field b; then get next batch of records with same a value and do the sorting by b; and so on. Repeat the retrieval and sort until all records in T are sorted. You don’t need to sort all records in T at once but only a group of them each time. The memory only needs to be able to hold a small group records, which requires much less resource.
It’s a pity that SQL doesn’t support this algorithm. The language always requires sorting all records. SPL offers good support of it. The following is our tests and comparison with Oracle SQL.
II. Test environment & computing scenario
The test computer has two Intel2670 16core processors of 2.6 GHz, a 64GB RAM and an SSD, where a 16core virtual machine with 8GB RAM is running.
We create data table salesman1 on the virtual machine. It has two fields – area (string type) and salesman (string type) and 4 hundred million records ordered by area in ascending order. area has 2000 unique values. Each area value corresponds to twenty thousand salesman values. To do the tests, we load the table to Oracle and generate an esProc SPL composite table from it.
Then we create a samestructure data table salesman2 for doing big data tests. This table has two billion records and 4000 unique values in area field where each value corresponds to 50 thousand salesman values.
The test computing task is to sort data table by both area and salesman.
III. Small data tests
1. With Oracle
Test SQL query:
select area, salesman from salesman1 order by area, salesman
This is the query for sorting. But as it takes extremely long to output all the sorting result, we only output the middle record, whose row number is 2 hundred million, for the convenience of measuring sorting time only. Then the SQL query is as follows:
select area, salesman from (
select area, salesman, rownum rn from (
select area, salesman from salesman1 order by area, salesman
)
) where rn=200000000;
The query is only for avoiding a largescale database sorting and excluding output time. We won’t write the query this way in realworld businesses.
2. In SPL
Test SPL script:
A 

1 
=now() 
2 
=file("/home/ctx/salesman1.ctx").create().cursor(area,salesman) 
3 
=A2.group@qs(area;salesman) 
4 
=A3.skip(199999999) 
5 
=A3.fetch(1) 
6 
=interval@s(A1,now()) 
@s option in group@qs enables a pure sorting over the data set without grouping it; @q option means the data set is already ordered by the sorting expression before semicolon and enables a special sorting algorithm that sorts only by the expression after semicolon (Here is salesman).
IV. Big data tests
1. With Oracle
Test SQL query:
select area, salesman from (
select area, salesman, rownum rn from (
select area, salesman from salesman2 order by area, salesman
)
) where rn=1000000000;
The query outputs the row numbered one billion.
2. In SPL
Test SPL script:
A 

1 
=now() 
2 
=file("/home/ctx/salesman2.ctx").create().cursor(area,salesman) 
3 
=A2.group@qs(area;salesman) 
4 
=A3.skip(999999999) 
5 
=A3.fetch(1) 
6 
=interval@s(A1,now()) 
V. Test results & explanation
Test results (Unit: second):
Data size 
4 hundred million records 
2 billion records 
Oracle 
326 
2556 
SPL 
186 
1266 
The SPL sorting algorithm that only sorts records by nonordered sorting field is 40% faster than the Oracle’s fullscale sorting when the data size is 4 hundred million and 50% faster with a 2 billion data size. The larger the data size is, the more effective the SPL algorithm is in enhancing performance.