Performance optimization skill: Use Multi-purpose traverse to speed up multiple grouping

 

As we all know, the bottleneck of big data computing performance often lies in external storage (i.e., hard disk) IO because the access performance of external storage is one or two orders of magnitude slower than that of memory. Therefore, it is sometimes more important to reduce the amount of hard disk access than to reduce the amount of CPU calculation when optimizing performance. If the algorithm with less hard disk access can be used, better performance will be obtained for the same task even with the same or slightly more CPU computation.

  Grouping and aggregation require traversing the dataset. The same dataset may be grouped by different dimensions, so it has to be traversed many times in principle and multiple hard disk accesses will be involved for big data. However, if we can calculate the grouping results of multiple dimensions in one traversing, it will reduce a lot of hard disk access.

  Unfortunately, SQL does not support such an operation (returning multiple grouping results in one traversing), it can only traverse many times or depends on whether the database engine can be optimized. While SPL supports the syntax of multi-purpose traverse, which can calculate multiple grouping results at one traversal, thus improving performance.

  Let's do some test to see if Oracle can optimize the calculation of multiple traversals and the effect of multi-purpose traverse on performance in SPL.

 

1. Data preparation and environment

  The SPL script generates a data file with two columns. The first column ID is a random integer of less than 2 billion, and the second column amount is a random real number of no more than 10 million. There are 8 billion rows of data records, and the size of the generated original text file is 169 G. With the data importing tool provided by the database, the file data is imported into Oracle's data table topn, and the SPL composite table file topn.ctx is also generated with the file data.

  The test is performed on an Intel server, 2 Intel 3014 CPUs, 1.7G frequency, 12 cores in total, 64G memory. The database table data and SPL composite table file are both stored on the same SSD hard disk.           

  The data volume is deliberately made larger than the memory to ensure that the operating system can not cache all the data into memory and will definitely read the hard disk in actual calculation.

 

2. Oracle Testing

  The test is divided into three situations: single grouping with single calculation amount, single grouping with double calculation amount, and double grouping with double calculation amount.

a)  Single grouping with single calculation amount

    select /*+ parallel(12) */ mod(id,100) Aid,max(amount) Amax from topn group by mod(id,100)

b)  Single grouping with double calculation amount

    select /*+ parallel(12) */

    mod(id,100)+floor(id/20000000) Aid, max(amount) Amax, min(amount) Amin from topn group by mod(id,100)+floor(id/20000000);

  The calculation expressions are doubled, which equals to twice the amount of calculation.

c)  Double grouping with double calculation amount

    select /*+ parallel(12) */ * from

    (select mod(id,100) Aid,max(amount) Amax from topn group by mod(id,100) ) A

    join

    (select floor(id/20000000) Bid,min(amount) Bmin from topn group by floor(id/20000000) ) B

    on A.Aid=B.Bid;

  The calculation amount here is basically the same as b), but there are two groups. So we will observe if the database will perform traversing twice. The last JOIN operation involves only 100 rows of data with negligible time.

 

3. SPL Testing

  Let's do the test in Oracle again in SPL.

a)  Single grouping with single calculation amount

  Write SPL script to execute the test:


A

1

=now()

2

=file("/home/topn/topn.ctx").open().cursor@m(id,amount)

3

=A2.groups@u(id%100:Aid;max(amount):Amax)

4

=interval@s(A1,now())

A2 defines a multi-cursor to execute in parallel, and the number of parallel is determined by the cursorParallelNum in the configuration file raqsoftConfig.xml.

b)  Single grouping with double calculation amount

  Write SPL script to execute the test:


A

1

=now()

2

=file("/home/topn/topn.ctx").open().cursor@m(id,amount)

3

=A2.groups@u(id%100+id\20000000:Aid;max(amount):Amax,min(amount):Amin)

4

=interval@s(A1,now())

 

c)  Double grouping with double calculation amount

  Write SPL script to execute the test:


A

B

1

=now()


2

=file("/home/topn/topn.ctx").open().cursor@m(id,amount)

3

cursor   A2

=A3.groups@u(id%100:Aid;max(amount):Amax)

4

cursor

=A4.groups@u(id\20000000:Bid;max(amount):Bmax)

5

=A3.join@i(Aid,A4:Bid,Bid,Bmax)

6

=interval@s(A1,now())


  The exclusive multi-purpose traversal syntax of SPL is used here. A cursor is defined in A2, and two sets of calculations for this cursor are defined in A3/B3 and A4/B4, which means that the two results will be calculated simultaneously in one cursor traversing.

 

4. Analysis and conclusions

  The time for the three situations is as follows:

  Test results (time unit: seconds)


single grouping with single calculation   amount

single grouping with double calculation   amount

double grouping with double calculation   amount

Oracle

458

692

878

SPL

336

350

376

  As seen in the test result of Oracle, double grouping with double calculation amount is nearly 200 seconds slower than single grouping with double calculation amount, which cannot be ignored anymore. Since the calculation amount is almost the same for both, the extra time is estimated to be spent in one more traversing. This means that the database will not automatically optimize traversal and the data table will be traversed twice for double grouping. As a result, it will take almost twice as long to do one more grouping.

  However, SPL adopts the mechanism of multi-purpose traversing, and the calculation time of the three tests is very little in difference. It will not traverse once more to do one more grouping and does not slow down much with only some added control logic for multi-purpose traversing.  

Please note that the amountfield type of Oracle is set as decimalwhen preparing data, so the calculation speed is slower; whereas doubletype is used in SPL composite table, which makes the performance much faster. But this test is not a comparison of the computation performance of Oracle and SPL, and these differences do not affect the above conclusions.