Performance optimization skill:TopN

  TopN is a common operation, which is written in SQL like this(Oracle for example):

        select * from ( select * from T order by x desc) where rownum<=N

  This SQL's operation logic, from its statement point of view, needs to do the order by first, and then take out the first N.           

  We know that sorting is a very slow action with high complexity (n*logn). If the amount of data involved is too large to be stored in memory, it also needs to exchange data between memory and external storage, and the performance will decline dramatically.

  In fact, to compute TopN, we can design an algorithm that does not need complete sorting. Just keep a small set of size N, traverse the data set once, save the first N records of the traversed data in this small set. When a new piece of data is traversed, if it is larger than the current Nth record, insert it into the Nth place and discard the current Nth record. If it is smaller than the current Nth record, no operation is needed.           

  In this way, the computational complexity is much lower (n*logN, n is the total number of data items), and generally N is not large and can fit into memory. Regardless of the amount of data, the problem of data exchange between memory and external storage will not be involved.

  However, SQL can not describe the above calculation process, and we can only hope that the database engine can do the optimization by itself. It is easy to describe the above calculation process using SPL, so as to achieve high performance computing.           

  Let's do a test to see if Oracle can do this optimization, that is, to implement TopN with SPL and Oracle respectively, and make the comparison. Because SPL can use optimization algorithm, if Oracle's computing time is similar to that of SPL, then it has been optimized, and if it's far from it, it may have performed complete sorting.

1. Data preparation and environment

  SPL script is used to generate data file. There are two columns of data. The first column ID is less than 2 billion random integers, and the second column amount is no more than 10 million random real numbers. The data record is 8 billion lines, and the original text file size generated is 169 G. Using the data import tool provided by the database, the file data is imported into Oracle's data table topn, and the SPL group table file topn.ctx is generated with the file data.

  Complete the test on an Intel server, 2 Intel 3014 CPUs, main frequency 1.7G, a total of 12 cores, 64G memory. The database table data and SPL group table file are stored on the same SSD hard disk.           

  We deliberately designed the amount of data to be larger than memory, so if sorted, there will be data exchange between memory and external storage, and performance degradation will be very large, which is easy to be observed.

2. Routine TopN

  Take out the top 100 items with the largest amount in the topn table.

a)  Oracle Test

  The SQL statement for query is

    select * from ( 
        select  /*+ parallel(4) */ 
        * from topn order by amount desc 
    ) where rownum<=100;

  Note/*+ parallel(4) */ is Oracle's parallel query grammar, of which four is parallel number.

b)  SPL Test

  Write SPL script to execute test


A


1

=now()

/Recording time

2

=4

/Number of parallels

3

=file("/home/topn/topn.ctx").create()

/Create group table object

4

=A3.cursor@m(id,amount;;A2).groups(;top(100;-amount))


5

=interval@s(A1,now())

/ Calculating time   consumed

  Unlike SQL, SPL treats TopN as an aggregation operation, just like operations such as sum/count, only TopN returns a set, while sum/count returns a single value. But their computational logic is the same, they only need to traverse the original data set once, and do not involve complete sorting.

  groups(;top(100;-amount) in A4 is to do TopN aggregation operation on the whole set and calculate the top 100.

c)  Conclusion and analysis

  The routine TopN test time is shown in the following table:

  Test results (time unit: seconds)

Number   of parallel

1

2

4

8

12

Oracle

1922

952

526

308

256

SPL   group table

2641

1565

729

371

321

  Test shows that Oracle is a little faster than SPL, and SPL does not do a complete sorting, which indicates that Oracle will automatically optimize in this situation.           

  It is also understandable that Oracle is faster than SPL, because Oracle is developed in C++ and the current version of SPL is developed in java. It’s normal that C++ computing is faster than Java computing. Besides, this test reads all two columns of data, and the data is random and disorderly, which is difficult to compress, so the column storage of group table has no advantage.

3. Increasing complexity

  For the most basic TopN, Oracle is smart enough to optimize even if it is written as a sub-query. Let's increase the difficulty of the problem and do TopN in each group after grouping.

  The specific operation design is: grouping according to the last digit value of the ID field, and then querying the first 100 records with the largest amount in each group. ID is an integer, so there are only 10 groups. The calculation amount of grouping itself is not large, but to do TopN for grouped data, the overall calculation complexity is slightly higher. If there is no complete sorting, the total operation time should be more than the case before, but still within the same order of magnitude.

a)  Oracle Test

  The SQL statement for query is

    select * from ( 
        select  /*+ parallel(2) */ 
	    mod(id,10) as gid,amount,
            row_number() over (partition by mod(id,10) order by amount desc) rn 
        from topn
    ) where rn <= 100;

  SQL can not directly write out the operation of taking TopN after grouping. It can only use window function to calculate the row number. There are still words of order by in the grammar.

b)  SPL group table Test

  Write SPL script to execute test


A


1

=now()

/Recording time

2

=4

/ Number of   parallels

3

=file("/home/topn/topn.ctx").create()

/Create group table object

4

=A3.cursor@m(id,amount;;A2).groups@u(id%10:gid;top(100;-amount))


5

=interval@s(A1,now())

/ Calculating   time consumed

  Because SPL regards TopN as aggregate computing, it can be easily put into grouping aggregation, and has almost the same grammar as aggregation.

c)  Conclusion and analysis

  Test results (time unit: seconds)

Number   of parallel

1

2

4

8

12

Oracle

41649

19602

9359

4627

3211

SPL   group table

4380

2127

1007

465

349

 

  After increasing the difficulty, Oracle is more than 10 times slower than the previous simple case, and nearly 10 times slower than SPL to do the same operation. This indicates that Oracle is likely to have done a sort action in this case, and when the situation becomes more complex, Oracle's optimization engine does not work.           

 

  In these two cases, the difference of operation time of SPL is less than two times, basically in the same order of magnitude, which accords with our previous analysis, and the advantages of the algorithm are fully reflected.