Performance Optimization Skill: First-half Ordered Sorting

 

I. Problem introduction & solution

  When performing sorting operation on the data set, the following situation sometimes may occur: when the data set T is already ordered by field abut not by field b, we want to sort T by fields a, “b, and this is what we call the first-half ordered sorting (a is ordered). For this case, we come up with an optimization technique to speed up the sorting: first retrieve a group of records with the same a value from T and sort them by field b; then get next group of records with the same a value in turn and do the sorting by b; and continue the above actions until all the records in T are retrieved and sorted. With this method, we dont need to sort the records in T all at once, instead, only a group of them is retrieved each time, which only requires the memory to be able to hold a small group of records.  

  However, its a pity that SQL doesnt support this algorithm, and it always requires the full sorting on all records. While SPL is in support of this algorithm, so well test it and compare it with the sorting in Oracle.

 

II. Test environment & computing scenario

  The test computer has two Intel2670 CPUs, 2.6G frequency, 16 cores in total, 64G memory and an SSD hard disk, where a 16-core virtual machine with 8G memory is set for testing.

  On the virtual machine, we create data table salesman1 which consists of two fields, area (strings) and salesman (strings), and generate 400 million rows of records sorted by field area in ascending order. The field area has 2,000 distinct values with each areavalue corresponding to 200,000 salesman values. Then we import the data table into Oracle database and use it to generate an esProc SPL composite table for testing.

  Then we create another same-structure data table salesman2 for big data tests. This table has 2 billion rows of records and 4,000 distinct values in area field where each value corresponds to 500,000 salesman values. 

  All the tests is required to sort the data table by both area and salesman.

 

III. Small data tests

1. Oracle

  SQL query statement:

    select area, salesman from salesman1 order by area, salesman

 

  This simple SQL statement is enough for the sorting. However, it may take extremely long to output all the sorting result, so we rewrite the SQL statement to output only the row in the middle position, i.e., the row with 200 million as its row number, for the purpose of decreasing the output amount. In this case, only the time of sorting process is counted:

    select area, salesman from (

        select area, salesman, rownum rn from (

            select area, salesman from salesman1 order by area, salesman

        )

    ) where rn=200000000;

  In addition, the no-business significant query is only written to force database to perform the full sorting and avoid the output time. 

2. SPL

  SPL script:


A

1

=now()

2

=file("/home/ctx/salesman1.ctx").open().cursor(area,salesman)

3

=A2.group@qs(area;salesman)

4

=A3.skip(199999999)

5

=A3.fetch(1)

6

=interval@s(A1,now())

  The @s option in group@qs is used to sort the data set only rather than grouping it; the @q option means the data set is already ordered by the sorting expression before semicolon (area) and enables the first-half ordered sorting with the expression after semicolon (salesman).

 

IV. Big data tests

1. Oracle

  SQL query statement:

    select area, salesman from (

        select area, salesman, rownum rn from (

            select area, salesman from salesman2 order by area, salesman

        )

    ) where rn=1000000000;

  The query outputs the row with 1 billion as its row number. 

2. SPL

  SPL script:


A

1

=now()

2

=file("/home/ctx/salesman2.ctx").open().cursor(area,salesman)

3

=A2.group@qs(area;salesman)

4

=A3.skip(999999999)

5

=A3.fetch(1)

6

=interval@s(A1,now())

 

V. Test results & explanation

  Test results (Unit: second):

Data size

400 million rows of records

2 billion rows of records

Oracle

326

2556

SPL

186

1266

  According to the test results, the execution time of the SPL first-half ordered sorting algorithm is 60% of that of full sorting in Oracle when there are 400 million rows of data and only 50% of that with 2 billion rows of data, which shows better performance. In conclusion, the larger the data size is, the more effective the first-half ordered sorting algorithm is in enhancing the performance.