Performance Optimization - 5.1 [Traversal technology] Ordered grouping and aggregating

 

If the data table is ordered by grouping keys, we can implement the ordered grouping algorithm.

The process of ordered grouping is very simple, you only need to compare the key value of the current record and the last grouped subset when traversing. If two values are the same, continue to put this record into this grouped subset. On the contrary, if two values are different, it indicates that this grouped subset has been calculated completely, in this case, re-create a new grouped subset, and then put this record into the new subset. Repeat these steps until the end of the traversal, and you can get all grouped subsets.

Similarly, for the big data, the more common way is to obtain the aggregation value of each grouped subset. In this case, only one grouped result set needs to be kept. When traversing at a record, judge the grouping key value so as to decide whether to accumulate the current record value to the last grouped record or generate a new grouped record. This algorithm applies to all accumulable aggregation operations (describable by iteration function).

The ordered grouping only needs to compare adjacent records, so its complexity is very low, and the luck problem in hash grouping does not exist in this algorithm. The records in grouped subsets or grouped result set are calculated one by one, thus it will not change the calculated grouped subsets and grouped records when a new data is traversed. Such computing process does not need a large memory space to store calculation results, which is well suited for returning the results in the form of a cursor, therefore, it is naturally suitable for big grouping.

In fact, we introduced the sort grouping of big grouping earlier, its second round uses exactly this algorithm.

For instance, for the orders ordered by time, the total amount of orders per day can be calculated by ordered grouping algorithm:

A
1 =file(“orders.btx”).cursor@b()
2 =A1.groups@o(dt;sum(amount))

In SPL, the groups@o()represents the ordered grouping, yet here still returns a table sequence, using groups() is considered as small grouping.

The groups@o()for small grouping can work on multi-cursor. The data table ordered by grouping key values can be deemed that it is composed of successive grouped subsets. Segmentation does not necessarily happen between two grouped subsets (possibly within a certain grouped subset), but groups@o() will finally adjust the calculation result.

SPL does not provide a corresponding groupx@o(), the ordered grouping algorithm to return a big result uses another function:

A
1 =file(“orders.btx”).cursor@b()
2 =A1.group(dt).new(dt,~.sum(amount))
3 =A1.group(dt;sum(amount))

By default, the group() function on the cursor will consider that the cursor is ordered by grouping key values, it will return a cursor, and each grouped subset can be taken out in turn, after that, further operations can be performed. If the aggregation expression is written in the parameters, the cumulative method will be used directly to perform the aggregation, in this case, a cursor will still be returned, and however, the cursor members will be the grouped records consisting of grouped aggregation values.

Here, A2 and A3 perform the operations of the same result in different ways. In the way of A2, each grouped subset needs to be taken out, thus it requires that the grouped subset cannot be too large; while in the way of A3, the calculation process of iteration function is used to perform the aggregation calculation, which can calculate normally when it encounters a big grouped subset. If you only need to get the grouped aggregation value, it is more advantageous to use A3 code.

Both A2 and A3 are cursors, and fetch()is still needed to take out calculation data. Moreover, the group() function is a delayed cursor, actual calculation will be performed only while data-fetching.

For big grouping, group()cannot directly support the multi-cursor. If the segmentation position falls within a certain grouped subset, an incorrect calculation result will occur. Theoretically, we can adopt the method of discarding the head and complementing the tail discussed in the text file segmentation section, which is described as follows: starting from the segment position (except the first segment), find the next record with a different grouping key value, then start traversing this segment. After this segment is traversed, continue to the next segment and fetch a record and continue to traverse until this grouped subset is completely traversed. In this way, a correct calculation result can be ensured in case the multi-cursor is used for segmentation.

Unlike a text file, however, a certain grouped subset may be very large, which may cause a very large “head” to be discarded in the above-mentioned method, resulting in a decrease in performance due to repeated read. Moreover, since it is not clear how the cursor is calculated out (cursor may come from various sources) during the execution of group() function, the action of the said method should be well handled when the data table is segmented, it requires you to know which field the data table is sorted by when segmenting.

SPL’s composite table provides another segmentation mechanism. Because SPL needs to specify the data structure in advance when creating a composite table, therefore, some information like ordered fields is known in advance, it can ensure that the segmentation point will definitely fall between the grouped subsets when segmenting. The text files and bin files do not need to specify the data table structure in advance, this mechanism is not provided.

A
1 =file(“orders.ctx”).create@p(#dt,…)
2 =file(“orders.ctx”).open().cursor@m(dt,amount;;4)
3 =A2.group(dt;sum(amount))

While creating a composite table, @p can be used to inform SPL not to put records with the same value in the first field (usually an ordered field) into two segments while segmenting. Currently, the segmentation processing is only provided for the first field, as the past practice has shown that it is enough.

Then, you can use multi-cursor to calculate. Note that the syntax of multi-cursor for composite table are a little different from those for text and bin files, which need to be written at the parameters after the second semicolon.