Performance Optimization - 4.7 [Traversal technology] Understandings about aggregation
Let’s consider this question: how to find the top 10 out of 100 million order amounts.
The simple idea is to sort the 100 million records from large to small by amount, and take the amount field of the first 10 records, and then discard the rest.
Writing SQL in the database to solve this problem uses exactly this idea.
However, sorting itself is a very slow action, moreover, and big sorting also involves data-buffering, it will not only lead to significant decrease in performance, but it is difficult to perform the parallel computing.
Actually, it is easy for us to think of a simpler algorithm.
As long as we keep a result set with 10 members, first, fill with 0 (any small number will do), and then, traverse the order table. If the current order amount is bigger than the smallest one in the result set, replace the smallest one with this amount. The result set, after traversal, is the number we want.
This algorithm only needs to traverse the original data table once, and there is no need to sort (actually, the traversal times of sort is log2N), let alone buffer the files. Moreover, the parallel computing in segment is also very easy, it is nothing more than calculating the top 10 of each segment, and then calculating the top 10 of the union of the said top 10. This algorithm still does not involve a concurrent write to the hard disk.
If you want to take the top M out of N members, the complexity of sort is N*logN, while the complexity of the above algorithm is N*logM. For this example, even for an all-in-memory operation, the CPU computation can decrease by around 8 times.
This idea is not uncommon. If the question is changed to calculate the maximum value, then almost everyone will think of using this method. But when the question is changed to take the top M, we will be more used to thinking of sort method first.
This requires us to expand our understanding on aggregation operation.
Usually, the aggregation operation we understand is to calculate a set to a single value, such as summing, counting, computing maximum / minimum value. However, if we expand our idea to regard the case where the return value is a small set as aggregation operation, then we can use aggregation operation to solve relevant issues.
Let’s look at this example again, we can regard “taking the top N” as an aggregation operation which is the same as summation and counting operations, except that it returns a set rather than a single value.
|2||=A1.total(sum(amount), top(-10,amount), top(-10;amount) )|
The top()function in SPL, the same as sum(), is considered as an aggregation function, and only has a direction from small to large. The -10 in the parameter means to take the last 10 which are the 10 with maximum amounts. The top(-10,amount) will return 10 maximum amount values, while top(-10;amount) will return 10 records that maximize the amount.
Another advantage of regarding top()as an aggregation function is that it can be used in grouping and aggregating where it is still the same as functions such as sum(), count(), max() and min():
In this way, the top 10 order amounts and corresponding orders in each region can be calculated, moreover, the parallel computing of multi-cursor can be used. Attention should be given that the values of the last two fields in A2 calculation results are sequence (set).
If top() is not regarded as an aggregate function, it will be very difficult to do this operation in grouping and aggregating. In SQL, you need to use window functions to barely describe this kind of operation, and the operation performance is very poor.
Aggregation operation is, in essence, an operation for a set, however, when we actually perform this operation, it is often unnecessary to get all members of set ready. Many aggregation operations can be performed by using the cumulative method to gradually traverse the members in a set, in this way, the operation for big data can be performed.
Operations like summing, counting, computing maximum / minimum value as well as “taking top M” just mentioned, all meet this characteristic.
We call this type of aggregate functions iteration function, and the following characteristics can be abstracted from its operation process:
1) An initial value is given as the calculation result;
2) Each time a new set member is encountered, perform a computing on this member and last calculation result to get a new calculation result;
3) After traversal, the calculation result can be returned.
To calculate iteration function, you only need to keep a current result value, and the traversed set members can be discarded. Even for the aggregation of grouped subsets in grouping operation, only one current result value needs to be kept for each group, and occupied memory is small. To perform the calculation of iteration function, it only needs to traverse the original data table once, and there is no need to buffer the files, and the parallel computing can also be performed.
SPL designs a general form for iteration functions:
Where, a is an initial value; x is an expression, in which ~~ represents last calculation result, and ~ represents the current set member. The calculated x is used as the new calculation result, i.e. the ~~ of the next calculation. After traversal, this function will return current ~~.
We can use this form to define common aggregation operations such as sum()and count():
For top(), although it is more troublesome, it can also be defined. You can take it as an exercise.
We can now expand aggregation operation to more general cases. As long as these cases can be described by iterate(), and the calculation results occupy just a little memory, the cumulative method can be used to achieve higher computing performance. This method can be used in grouping where original data table only needs to be traversed once and buffering the files is not needed (buffering is still needed in big grouping, but it comes from big grouping itself, and is not caused by aggregation calculation). However, there are some differences in parallel computing of iteration functions. Iteration function itself can perform parallel computing for segmented data tables to obtain multiple calculation results (one for each segment), but the manual coding is needed to perform the second round of aggregating (re-aggregate the calculation results of each segment to one result), and the parallel computing cannot be performed directly based on multi-cursor (fork syntax of multi-cursor can be used, but it still needs to do the second round of aggregating operation manually).
We can also use iterate() to perform some aggregation operations that are not defined in advance, such as continuous multiplying, getting the union, etc
Here, getting the union is exactly an undefined aggregation operation.
SPL stipulates that, in the iteration function for calculating the records, the field name can be directly used to represent the field of current record, and there is no need to write it as ~.area. A2 will calculate the region to which all orders are sold, and A3 will calculate the region to which each product is sold.
In structured data operations, the common simple aggregation operations are the above-mentioned several operations that have been defined, or operations that can be derived from such operations. For example, the continuous multiplying can be achieved by taking logarithm, summing and then using exponent. For “getting the union” in the above example, it can also be replaced by DISTINCT operation (id() function in SPL). Custom aggregation operations that need to be written with iteration functions are not uncommon, but such operations may involve more complex business backgrounds, which are not easily illustrated with simple examples.
Assuming that the order table is ordered by time, and if you want to calculate the total tax of each product with an initial tax rate of 5% and a tax rate reduced to 3% for subsequent orders after the cumulative tax amount exceeds 10,000, then this aggregation can hardly be described by conventional functions but iteration function, in this way, it can also be calculated at a higher performance.
The following iteration function will calculate the maximum number of consecutive orders with an amount exceeding 50 in each region.
In the next chapter, we will discuss the form of iteration function applied to ordered detail data , and will give more meaningful examples.