Performance Optimization - 5.5 [Traversal technology] Second-half ordered grouping

 

Performance Optimization - 5.4 [Traversal technology] First-half ordered grouping

We have handled the “first-half ordered” situation, then is there something we can take advantage of in the “second-half ordered” situation?

Still, we take the account transaction table as an example, where the data under each account is sorted by date, now we want to count the total transaction amount under all dates. Since the whole table is not ordered by date, the ordered grouping algorithm cannot be used. Yet, because the number of dates in a few years is not large, it still belongs to a small grouping, which can be easily written with groups():

A
1 =file(“trades.btx”).cursor@b(dt,amount)
2 =A1.groups(dt;sum(amount))

However, this code does not take advantage of the characteristic that the data in the account is ordered by date, and remains a conventional hash grouping mechanism.

Let’s take this example to explain how to take advantage of “second-half ordered” situation.

Assuming that we use the ordered grouping mechanism to implement the “second-half ordered” situation. In this way, when traversing the records, we only need to compare the grouping key with the current last grouped record to determine whether to generate a new group or accumulate this record. As long as the dt is in order in the same account, errors will not occur, and the calculation and comparison of hash values can also decrease. When traversing to the next account, the dt will start resorting, and the dt of new record may be smaller than that of the current last grouped record. Once this phenomenon is found, we need to compare this dt with the first record of the current grouped records to find an appropriate position where we can either accumulate this record or insert a record, and then traverse the next transaction record. There is no need to compare once again the record that has been compared in grouped record table because it is also in order, and it will be compared with the grouped record table from the head again when the next account is traversed.

As a result, when the accounts are traversed one by one, this grouped record table will be compared again and again from the head to the tail. Every time a record is traversed, the ordered grouping algorithm only compares it once, but many times of comparison may be needed before this algorithm can find an appropriate position. However, as long as the grouped record is compared, it will not be compared again until the next account is traversed.

As a whole, the comparison times will certainly be more than that of ordered grouping, but if most of the data are in order, the occurrence proportion of comparison from the head (when traversing to the next account) will be relatively less (within the same account), and the increased comparison times will not be too many, which may be more advantageous than hash grouping.

Here comes a key problem that it may involve the insertion to grouped record table. The conventional sequential insertion is a very complex action. To maintain order, it needs to move all the following members backward to make room, and the complexity of this operation will offset all the advantages of taking advantages of order. Thus, we need to use the linked list to maintain the grouped record table, in this way, only the property of member before and after the insertion point is changed when inserting, and a large number of members will not be moved. We can’t do binary search after using the linked list, yet the grouped record table is originally compared in order, therefore it has little impact here.

SPL also makes this mechanism an option, which can be used directly:

A
1 =file(“trades.btx”).cursor@b(dt,amount)
2 =A1.groups@h(dt;sum(amount))

Executing the groups@h() will not use the conventional hash method but the linked list mechanism mentioned above.

Big grouping can also perform this kind of second-half ordered grouping with a similar algorithm. The difference is that when the grouped record table in memory is large enough to a certain extent, it will be written out to the buffer file, and then the process will start again, finally, these buffer files will be merged. Generally, the process is similar to that of sorting big grouping, and it is also self-adaptively to big grouping and small grouping.

A
1 =file(“trades0.btx”).cursor@b()
2 =A1.groupx@h(id;sum(amount))

Suppose that the trade0 is sorted by date, and the transactions under the same date are sorted by account, this code can calculate the total transaction amount of each account at higher performance.

However, the main cost of big grouping depends on the writing and reading of buffer files, this algorithm will still have advantages, but not as significant as small grouping.

Theoretically, this algorithm works for any data, and does not necessarily require “second-half ordered” situation because any data can actually be regarded as second-half ordered data. However, if the ordered degree is too low, leading to frequent comparison with grouped record table from the head, the performance will not be comparable to hash grouping, therefore, attention should be given to the ordered degree of data when using this algorithm.

The higher ordered degree in the second-half, the more obvious the advantages of this algorithm will be. When the ordered degree reaches the extreme, it becomes an ordered grouping.

Unlike the method for first-half ordered situation, which can be used for sorting, this algorithm has little significance on sorting. The quick sort algorithm commonly used by small sorting has already taken advantage of the ordered degree of original data, while the main cost of big sorting depends on the writing and reading of buffer files, this algorithm does not reduce the number of buffer files, so it has no obvious advantages.

Since the grouping operation may involve aggregation, the actual buffer file is usually smaller than original data, sometimes much smaller, as a result, the time cost proportion from comparison in memory will become higher, and the advantages of this algorithm will appear. However, if each grouping key is almost unique in the whole table, and actual aggregation rarely occurs, and there is little difference between buffer file and original table, this algorithm will not have much advantage over hash grouping or sort grouping.


Performance Optimization - 5.6 [Traversal technology] Serial number grouping and controllable segmenting
Performance Optimization - Preface