Performance Optimization - 4.6 [Traversal technology] Grouping and aggregating
Performance Optimization - 4.5 [Traversal technology] Multi-cursor
Grouping is a common traversal type operation, which needs to read and calculate all records participating in grouping. For this kind of operation that requires the participation of the whole set, index makes little sense (it is useful only in very few scenarios, which will be discussed in the next chapter). Since some programmers do not understand the principle of grouping, they will add index to the database table when the grouping speed is slow, which will increase the load of database.
Grouping process is usually divided into the following steps: i)generate an empty grouped result set; ii) traverse each original data and calculate their respective grouping key value; iii) search for the grouped subsets corresponding to each key value in the grouped result set and then add this record to the grouped subset. If the grouped subset cannot be found, add a new grouped subset composed of this record.
During this operation process, the number of occurrences of actions, including reading the records from original data, calculating their grouping key values and adding records to the grouped subset, is fixed (equal to the record number of original data), and cannot be reduced. The only way to reduce operations is to optimize the action of searching for the grouped subset in the grouped result set, which is a standard search operation. When no special conditions are available, this search action will generally adopt the hash method. That is, sort the grouped subsets by the hash value of their corresponding grouping key values (equivalent to sequence number positioning), and calculate the grouping key value and its hash value when encountering a new record, which can quickly find its own group in a small number of grouped subsets with identical hash values. We call this grouping algorithm hash grouping.
The size of all grouped subsets is as large as the original data set. If the amount of data is so large that the memory cannot hold, then the memory cannot hold the grouped subsets, either. Therefore, hash method is only suitable for in-memory data set. However, in most cases, grouping is accompanied with aggregation. We do not need to retain these grouped subsets, but only need to calculate the aggregation value of grouped subsets and, these aggregation values can often be calculated using cumulative methods such as summing, counting, computing max./min. Therefore, we can discard the grouped subsets, and just keep the grouping key value and corresponding aggregate value (it is equivalent to a table sequence rather than a set of sets). In this way, the result set will be much smaller, and it is also possible to get a small grouped result set that can be stored in memory even if the amount of original data is large. In this process, it still needs to find the target record to do accumulation, and it also needs to use hash scheme, so this kind of grouping and aggregating is still called hash grouping.
Sometimes, even if only the aggregation value is needed, the grouped result sets may still be too large to be stored in memory, which means the number of grouping key values is very large. This situation is called big grouping. Correspondingly, the situation where the grouped result is so small that it can be hold in memory is called small grouping.
When dealing with big grouping, it is necessary to extend the hash grouping algorithm to external storage. To do this, we need to expand the range of hash values so that the grouping key values are scattered under different hash values as much as possible to make not many grouping key values corresponding to the same hash value. Since the range of hash function is known in advance, we could divide this range into several intervals that can be stored in memory (simple equal division is OK). When traversing data, every time a batch of records is read, calculate the hash value of its grouping key value, and then write it to different buffer files based on the interval in which it is located, and then free the memory space to read the next batch of records until the end of traversal. After that, read the data from each buffer file separately, and do hash grouping again. Since the hash values of data in each buffer file fall within one of intervals, they can definitely be held in memory, so they can be processed separately without causing out of memory.
SPL designs two functions respectively for big grouping aggregation and small grouping aggregation. Small grouping will directly return a result set, while big grouping will return a cursor. This cursor is based on the above-mentioned buffer file, and the second round of grouping and aggregating will be performed during data fetching.
A | |
---|---|
1 | =file(“orders.btx”).cursor@b(area,amount) |
2 | =A1.groups(area;sum(amount):amount) |
3 | =A1.groupx@u(area;sum(amount):amount).fetch() |
The parameter rules of the two functions are the same. The groups()function of small grouping will directly use hash algorithm, while the groupx() function of big grouping will use hash algorithm only after adding @u option.
Small grouping does not need to generate buffer files, while big grouping certainly does. If using the function of big grouping to do small grouping, a lot of time will be wasted for writing buffer file, even if the final grouped result is small. It is very important to predict the size of the grouped result set in advance and select an appropriate function. Therefore, SPL provides two grouping functions, you can compare the calculation time of A2 and A3 respectively.
We also find that the order of grouped result sets returned by groupx()is disordered, it seems that it has nothing to do with the grouping key value as well as the order of original data set. In fact, this is exactly the characteristic of hash grouping, and the reason why it looks disordered is that the result set of hash grouping is sorted by the hash value of grouping key value. On the contrary, SPL will perform a sort by the grouping key value before the groups() function of small grouping is returned, so it appears to be ordered by the key value. If using groupx@u, the same result will be found.
There is another method to implement big grouping, and this method can also implement big sorting.
The specific steps are as follows: 1)Traverse each record in turn and perform the hash grouping method described above (the range of hash values cannot be too large); 2)Constantly monitor the number of valid grouping key values in the grouped result set in memory; 3) Once the number reaches a threshold, sort the current grouped result set by the grouping key value, and then write it to a buffer file, and finally clear the result set to free the memory space; 4) Continue to traverse the remaining records. Repeat this process until the traversal is completed, and we will finally obtain a batch of buffer files. Since the data in these files are already ordered by the grouping key value, we just need to perform the ordered merge and ordered grouping and aggregating operations, which can be achieved with only a small amount of memory. This kind of grouping algorithm is called sort grouping.
Similarly, the result of such big grouping operations is also a cursor based on buffer files, and the second round of merging, grouping and aggregating is performed in the process of cursor fetching.
A | |
---|---|
1 | =file(“orders.btx”).cursor@b(area,amount) |
2 | =A1.groupx(area;sum(amount):amount).fetch() |
The groupx() without option works according to this algorithm.
Compared to hash grouping, sort grouping has some advantages. To be specific, the grouped result set of sort grouping is directly ordered by the grouping key value and may be used in the next round of calculation. More conveniently, it allows you to use big grouping algorithms (and functions) to implement small grouping. A careful study of the above algorithm process will reveal that if the actual result set is small, it will not really trigger the action of writing buffer files, because the grouped result set in memory will never be so large that it needs to be buffered and written out.
Another advantage of sort grouping is that it is more stable. When we are unlucky with the hash function, it is just a waste of time for in-memory search, but for big grouping, we may encounter a situation where the grouping key values corresponding to a certain hash value are too many to be held in memory, we need to do a second round of hashing, which is very cumbersome and inefficient.
Therefore, the default big grouping provided by SPL is sort grouping (without options).
However, if we know for sure that it is a large grouped result set (that is, buffer files will definitely be written), and we are lucky with the hash function, then the hash grouping is more efficient than sort grouping. The reason is that the sort itself is relatively slow, and multiple buffer files need to be read at the same time when the second round of merging is required, resulting in more concurrent read of hard disk. The second round of hash grouping, however, only needs to read one buffer file at a time, which will not cause concurrent read of hard disk. Therefore, SPL also provides the method of hash big grouping.
Database usually uses an optimized hash grouping method. This method will first try a small range of hash functions. If too many grouping key values are found, it will do the second hashing and perform buffering. In this way, the phenomenon that buffer data is always written can be avoided. This method has better performance for small grouping, but the algorithm process is much more troublesome, and the performance will drop significantly when doing big grouping.
However, even if sort grouping can be self-adaptive to small grouping and big grouping, groupx()is still more complex and a little worse in performance than groups() in practice. More importantly, the parallel effect of big grouping is not good. Specifically, multiple threads will accumulate data to the same intermediate result set at the same time, which will often result in wait state because of the preemption of write rights. If each thread has its own intermediate result set, it will cause the memory to be split (each thread can only use a fraction of memory), which will lead to a phenomenon where originally, there may be no need to write buffer files (the whole memory is large enough to hold the grouped result set), but there will be a need to write buffer files (because a fraction of memory is not enough). As a result, disk read and write become very slow, easily offsetting the benefits of CPU’s parallel operation. Even in cases where writing buffer files is necessary, writing buffer files by multiple threads at the same time will cause concurrent writes on the hard disk, which often severely impact performance. Therefore, the groupx() function may not necessarily achieve better performance for multi-cursor operation.
Therefore, if you know for sure that the result set is small, you still need to use groups()to achieve the best performance. If you can predict the size of result set, you can also choose an appropriate number of parallel threads. If you are not sure about the size of result set, using groupx() will be more reliable, and the performance loss in single thread is not significant.
Once we understand sort grouping, big data sorting becomes relatively simple. We just need to read a batch of data, and then sort the data and write them to buffer files, and finally perform the merge algorithm to sort these buffer files. The result set of sort operation has the same size as the original data set, and will not become smaller like grouping, so big sorting will definitely generate buffer files.
Similarly, it is not easy to obtain a linear performance improvement for parallel computing of big sorting. Although in-memory sorting can be faster, the concurrent writing of multiple threads to hard disk may offset the advantage.
SPL does not directly provide a hash grouping style big sorting algorithm. You can figure it out on your own after understanding the program cursor technology and sequence number segmentation mechanism described in the next chapter. Usually, big sorting is only used in data preparation stage, and the merge algorithm can be performed in most cases, and there are not many cases where sorting is performed on original big data.
Performance Optimization - 4.7 [Traversal technology] Understandings about aggregation
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL