Performance Optimization Skill: First-half Ordered Sorting
I. Problem introduction & solution
When performing sorting operation on the data set, the following situation sometimes may occur: when the data set T is already ordered by field “a” but not by field “b”, we want to sort T by fields “a”, “b”, and this is what we call the first-half ordered sorting (“a” is ordered). For this case, we come up with an optimization technique to speed up the sorting: first retrieve a group of records with the same “a” value from T and sort them by field “b”; then get next group of records with the same “a” value in turn and do the sorting by “b”; and continue the above actions until all the records in T are retrieved and sorted. With this method, we don’t need to sort the records in T all at once, instead, only a group of them is retrieved each time, which only requires the memory to be able to hold a small group of records.
However, it’s a pity that SQL doesn’t support this algorithm, and it always requires the full sorting on all records. While SPL is in support of this algorithm, so we’ll test it and compare it with the sorting in Oracle.
II. Test environment & computing scenario
The test computer has two Intel2670 CPUs, 2.6G frequency, 16 cores in total, 64G memory and an SSD hard disk, where a 16-core virtual machine with 8G memory is set for testing.
On the virtual machine, we create data table salesman1 which consists of two fields, area (strings) and salesman (strings), and generate 400 million rows of records sorted by field “area” in ascending order. The field “area” has 2,000 distinct values with each “area” value corresponding to 200,000 “salesman” values. Then we import the data table into Oracle database and use it to generate an esProc SPL composite table for testing.
Then we create another same-structure data table salesman2 for big data tests. This table has 2 billion rows of records and 4,000 distinct values in “area” field where each value corresponds to 500,000 “salesman” values.
All the tests is required to sort the data table by both “area” and “salesman”.
III. Small data tests
1. Oracle
SQL query statement:
select area, salesman from salesman1 order by area, salesman
This simple SQL statement is enough for the sorting. However, it may take extremely long to output all the sorting result, so we rewrite the SQL statement to output only the row in the middle position, i.e., the row with 200 million as its row number, for the purpose of decreasing the output amount. In this case, only the time of sorting process is counted:
select area, salesman from (
select area, salesman, rownum rn from (
select area, salesman from salesman1 order by area, salesman
)
) where rn=200000000;
In addition, the no-business significant query is only written to force database to perform the full sorting and avoid the output time.
2. SPL
SPL script:
A |
|
1 |
=now() |
2 |
=file("/home/ctx/salesman1.ctx").open().cursor(area,salesman) |
3 |
=A2.group@qs(area;salesman) |
4 |
=A3.skip(199999999) |
5 |
=A3.fetch(1) |
6 |
=interval@s(A1,now()) |
The @s option in group@qs is used to sort the data set only rather than grouping it; the @q option means the data set is already ordered by the sorting expression before semicolon (“area”) and enables the first-half ordered sorting with the expression after semicolon (“salesman”).
IV. Big data tests
1. Oracle
SQL query statement:
select area, salesman from (
select area, salesman, rownum rn from (
select area, salesman from salesman2 order by area, salesman
)
) where rn=1000000000;
The query outputs the row with 1 billion as its row number.
2. SPL
SPL script:
A |
|
1 |
=now() |
2 |
=file("/home/ctx/salesman2.ctx").open().cursor(area,salesman) |
3 |
=A2.group@qs(area;salesman) |
4 |
=A3.skip(999999999) |
5 |
=A3.fetch(1) |
6 |
=interval@s(A1,now()) |
V. Test results & explanation
Test results (Unit: second):
Data size |
400 million rows of records |
2 billion rows of records |
Oracle |
326 |
2556 |
SPL |
186 |
1266 |
According to the test results, the execution time of the SPL first-half ordered sorting algorithm is 60% of that of full sorting in Oracle when there are 400 million rows of data and only 50% of that with 2 billion rows of data, which shows better performance. In conclusion, the larger the data size is, the more effective the first-half ordered sorting algorithm is in enhancing the performance.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL