Performance optimization skill: Reuse of traversing result to speed up multiple grouping
We know that the bottleneck of big data computing performance is often on external storage (i.e. hard disk) IO, because external storage access performance is one or two orders of magnitude lower than memory. Therefore, in performance optimization, it is sometimes more important to reduce the amount of access to the hard disk than to reduce the amount of CPU calculation. For the same task, if the algorithm with less hard disk access can be used, even if the CPU computation is constant or even slightly more, better performance will be obtained.
Grouping and aggregation requires traversing the dataset. The same dataset may be grouped according to different dimensions, so in principle it has to be traversed many times, and multiple hard disk accesses will be involved in big data. However, if we can calculate the grouping results of multiple dimensions in one traversal process, it will reduce a lot of hard disk access.
Unfortunately, SQL can't write such an operation (returning multiple grouped results in one traversal), it can only traverse many times, or hope that the database engine can be optimized. SPL supports this traversal reuse grammar, which can calculate multiple grouping results at one traversal, thus improving performance.
Let's do a test to see if Oracle can optimize the calculation of multiple traversals and the effect of traversal reuse algorithm on performance in SPL.
1. Data preparation and environment
The SPL script generates a data file with two columns. The first column ID is a random integer of less than 2 billion, and the second column amount is a random real number of no more than 10 million. The data record is 8 billion lines, and the original text file size generated is 169 G. Using the data import tool provided by the database, the file data is imported into Oracle's data table topn, and the SPL group table file topn.ctx is generated with the file data.
Complete the test on an Intel server, 2 Intel 3014 CPUs, main frequency 1.7G, a total of 12 cores, 64G memory. The database table data and SPL group table file are stored on the same SSD hard disk.
The amount of data is deliberately made larger than memory, so as to ensure that the operating system can not cache all of these data into memory, and it will read the hard disk in the actual calculation.
2. Oracle Testing
The test is divided into three situations: single grouping with single calculation amount, single grouping with double calculation amount, and double grouping with double calculation amount.
a) Single grouping with single calculation amount
select /*+ parallel(12) */ mod(id,100) Aid,max(amount) Amax from topn group by mod(id,100)
b) Single grouping with double calculation amount
select /*+ parallel(12) */
mod(id,100)+floor(id/20000000) Aid, max(amount) Amax, min(amount) Amin from topn group by mod(id,100)+floor(id/20000000);
The calculation formula is doubled, that's twice the amount of calculation.
c) Double grouping with double calculation amount
select /*+ parallel(12) */ * from
( select mod(id,100) Aid,max(amount) Amax from topn group by mod(id,100) ) A
join
( select floor(id/20000000) Bid,min(amount) Bmin from topn group by floor(id/20000000) ) B
on A.Aid=B.Bid;
The amount of computation here is roughly the same as b), but there are two groups. We will see if the database will be traversed twice. The last JOIN operation involves only 100 rows of data, and the time is negligible.
3. SPL Testing
Let's run Oracle's test again with SPL.
a) Single grouping with single calculation amount
Write SPL script to execute test:
A 

1 
=now() 
2 
=12 
3 
=file("/home/topn/topn.ctx").create().cursor@m(id,amount;;A2) 
4 
=A3.groups@u(id%100:Aid;max(amount):Amax) 
5 
=interval@s(A1,now()) 
b) Single grouping with double calculation amount
Write SPL script to execute test:
A 

1 
=now() 
2 
=12 
3 
=file("/home/topn/topn.ctx").create().cursor@m(id,amount;;A2) 
4 
=A3.groups@u(id%100+id\20000000:Aid;max(amount):Amax,min(amount):Amin) 
5 
=interval@s(A1,now()) 
c) Double grouping with double calculation amount
Write SPL script to execute test:
A 
B 

1 
=now() 

2 
=12 

3 
=file("/home/topn/topn.ctx").create().cursor@m(id,amount;;A2) 

4 
cursor A3 
=A4.groups@u(id%100:Aid;max(amount):Amax) 
5 
cursor 
=A5.groups@u(id\20000000:Bid;max(amount):Bmax) 
6 
=A4.join@i(Aid,A5:Bid,Bid,Bmax) 

7 
=interval@s(A1,now()) 
The SPLspecific traversal reuse grammar is used here. A cursor is defined in A3, and two sets of calculations for this cursor are defined in A4/B4 and A5/B5, which means that the two results will be calculated simultaneously in one cursor traversal process.
4. Analysis and conclusions
The test time for the three cases is as follows:
Test results (time unit: seconds)
single grouping with single calculation amount 
single grouping with double calculation amount 
double grouping with double calculation amount 

Oracle 
458 
692 
878 
SPL 
336 
350 
376 
From Oracle's test result, double grouping with double calculation amount is nearly 200 seconds slower than single grouping with double calculation amount. This is not a negligible time, because the computing amount of the two is almost the same. The extra time is estimated to be one more traversal time. This means that the database will not automatically optimize traversal reuse. In case of double grouping the data table will be traversed twice. As a result, it will take almost twice as long to do one more grouping.
SPL adopts the mechanism of traversal reuse. The computing time of the three tests is very little in difference. It will not traverse more than once to do more grouping, and some logic of reuse control will not slow down much.
Please note, Oracle's amount field type is set to decimal when preparing data, so the calculation speed is slower; SPL group table uses double type, so it is much faster. But this test does not compare Oracle and SPL computing performance, these differences do not affect the above conclusions.