Performance Optimization - 5.2 [Traversal technology] Ordered grouped subsets
When the data table is ordered by grouping key, the grouped subsets can be read out in turn in the form of cursor, which allows us to do some complex operations.
Let’s take the one-year account transaction table as an example. We want to count the number of accounts with the consumption times of more than m within n consecutive days, where n and m are the parameters entered by users at program interface, and hope to find the result immediately.
It is a relatively complex operation which is unlikely to be written in a simple aggregation function (nor with an iteration function). Generally, the calculation will be easy when these transaction records are read into the memory, that is, you need to take out the transaction records under one account at a time to calculate. Since the transaction data under one account is generally very small, the memory is sufficient to hold them.
The subset grouped by account is exactly the transaction records under one account, yet the number of accounts in this situation may be very large, if this is the case, it is a typical big grouping, and it is impossible to store all grouped subsets in the memory. When the transaction records of an account are fetched every time, if there is no index, the whole table needs to be traversed, which is completely unacceptable; Even if there is an index, too-many fetching times may cause a slow computing speed because the original data table is usually sorted by transaction time (refer to the explanation in section 3.7: Search that returns a set).
If we sort the data table by account in advance (sort the transactions in an account by date), and then use the ordered grouping technology, we can easily take out these grouped subsets to perform the calculation:
|2||for A1;id||if (k=A2.(dt).len()-m)<=0||next|
For the ordered cursor, the for statement will fetch a grouped subset at a time, then judge whether there are m transactions within n days.
During actual operation, the code will be further optimized to read more accounts each time.
These lines of code can also be concatenated into the group function:
|2||=A1.group(id).((k=~.(dt).len()-m)>0 && k.pselect(~(#+m)-~(#)<=n)|
For each grouped subset taken out by group(), a logical expression will be calculated. If the result is true, it indicates that there are m transactions in n days. Furthermore, A2 will also return a cursor, we only need to traverse the cursor and count the number of true.
If the data table uses the composite table segmentation method described in the previous section, this operation can also work based on multi-cursor, in this case, we only need to add options at the function that generates the cursor:
|2||=A1.group(id).((k=~.(dt).len()-m)>0 && k.pselect(~(#+m)-~(#)<=n)|
The method to maintain and append the ordered data has been discussed in the previous chapters. As long as we pre-sort the data, and implement the technique of converting the date to an integer as mentioned earlier, it is also possible to obtain a high performance with immediate response even if the amount of data is very large. This calculation method can be one or two orders of magnitude faster than the cursor calculation method on traditional database (you cannot write this kind of logic in a single SQL statement).
The ordered grouped subset technology is very useful for improving the performance of complex analysis on massive accounts.
With grouped subsets, it is also very easy to achieve DISTINCT, and we only need to take one record in each group. For example, let’s calculate in which months each account in the transaction table was traded.
From the cursor calculated out in A2, we can know the account and months in which the transaction occurred, actually, only the first record of each grouped subset is taken. Note that this is just a cursor, we also need to use the fetch() to actually calculate and fetch the data. There will be many examples like this below, the only thing we need to do is to get the cursor because the result set may be too large to be taken out completely. After getting the cursor, we can do further calculation or save the result set as a file.
The operation for taking the grouped subset is relatively common, SPL provides the options:
The group@1()will do the same thing as above, but will not firstly generate the grouped subset. In this way, the memory consumption will be less, and it can also suit to the situation where the grouped subset is sometimes large (but it is usually a small grouping under this situation).
In fact, group@1()can be understood as DISTINCT whose effect is basically the same with that of id() function. When the data is in order, DISTINCT can be performed efficiently, while when the data is out of order, DISTINCT is as complex as grouping.
Let’s take the above account transaction table again as an example, we now want to add a monthly cumulative amount information for each record, i.e., the cumulative transaction amount of the account in a month after this transaction is completed, and then filter out the transaction (including date) when the cumulative transaction amount of each account exceeds 100 for the first time every month.
This information can be calculated after reading the grouped subset:
This kind of cumulative calculation can be performed using the iteration function for detailed data. The iteration function here still has the aforementioned characteristics: there is an initial calculation result, and the traversed members are used to calculate new result each time. Unlike the aggregation function that only returns the final result, the iteration function will return the current calculation result every time when detailed data is involved. Consequently, the monthly cumulative amount as of each transaction can be calculated out in the field added for the derive function in A2.
It should be noted that each piece of data, fetched by the cursor that is calculated out in A2, is a table sequence, and this table sequence needs to be filtered in A3 (take out the first record whose cumulative amount reaches the requirement), instead of filtering the cursor directly.
However, this algorithm needs to take out the grouped subset. If we change to another situation where the order table is sorted by product and date, and the requirement is changed to calculate out the date when the monthly cumulative order amount of each product exceeds 100, then the method of taking out the grouped subset in advance followed by calculating will not work because there may be many transaction records of one product and it may be a large grouped subset.
For the cumulative calculation on the ordered cursor, we can also use the grouping parameter of iteration function to achieve:
|2||=A1.derive(iterate( ~~+amount,0; prudoct,month(dt) ): mca)|
The parameter after the semicolon in the iterate()parameter is used to represent the grouped fields. When these fields (or expressions) change, SPL will restart the calculation of a new round of iteration function (re-set the calculation result as initial value and continue to iterate). During the iteration calculation, we only need to compare the previous record, without the need to firstly take out the whole grouped subset, in this way, it can either make the memory footprint less or support the large grouped subset. Moreover, since the cumulated amount field will be added to original record, the select() can be directly performed to cursor in A3.
Finally, we need to use group@1() to perform DISTINCT. What is taken out at this time is the first record of grouped subset. Although the grouping key is product and month, the taken-out record is the record before grouping, i.e., the record containing the dt field.