Performance Optimization - 5.6 [Traversal technology] Serial number grouping and controllable segmenting

 

SPL provides an in-memory serial number grouping and aggregating function groups@n, and a same function for external storage cursor.

If the grouping key value can be computed to the serial number in a simply way, then there is no need to calculate and compare the hash values when grouping. Instead, we can directly use serial number to find the group (to do aggregating and accumulating), and the performance will be better.

Note that the computation must be a simple process. If it is too complex (for example, the serial number can only be obtained after searching a table), it cannot catch up with computing the hash value.

A
1 =file(“trades.ctx”).open().cursor(dt,amonut)
2 =A1.groups@n(month(dt);sum(amount))

In this way, the transaction amount can be grouped and aggregated efficiently by month. This code is suitable for month or year that can be easily computed to serial number. We can use the data type conversion scheme discussed in Chapter 2 to quickly compute the serial number of month and year from the integer date.

The groups@n() also supports multi-cursor.

The groups() will return small groups, while for big grouping, can we also use serial number to improve performance?

It certainly can in theory. Actually, this is a simplified hash method for big grouping, and the grouping key itself is the hash value, thus the time of computation and comparison is omitted, and the returned cursor will naturally be ordered by grouping key.

A
1 =file(“trades2.btx”).cursor(id,amount)
2 =A1.groupx@n(id;sum(amount))

Suppose that the account id is continuous natural number, in this way, the transaction amount of each account can be calculated.

However, unlike the serial number grouping for small grouping that has obvious performance advantages over hash grouping, the serial number grouping for big grouping is not significantly faster than the hash method for big grouping. Since the time of big grouping is mainly consumed at writing and reading the buffer files, the time difference between using serial number and hash value to locate the grouped records in memory can be neglected relative to the reading and writing time of buffer file.

Compared with conventional hash grouping, the significance of groupx@n() is that it does not need the second round of hashing caused by “unlucky hash function” that may occur in hash grouping.

In the process of hash grouping, knowing the range of hash values (which can be determined by hash function) can control the number of segments in a relatively reasonably way (let’s review the hash method for big grouping, that is, segment the hash values first, and then write the corresponding groups into different buffer files). For the serial number grouping, however, it is not known what the maximum serial number is, and the system may be conservative when it estimates the serial number by itself, resulting in too many segments, and affecting performance as well. Therefore, group@n() adds a parameter that allows manual control of the segmentation rules:

A
1 =file(“trades2.btx”).cursor(id,amount)
2 =A1.groupx@n(id;sum(amount);1000000)

The third parameter of groupx@n()represents how many serial numbers (i.e., grouped records) contained in each segment, which allows the programmers to control the number of segments according to data characteristics and memory capacity.

For groupings other than serial number grouping, it also makes sense to manually control the segmentation. SPL provides another option:

A
1 =file(“trades2.btx”).cursor(id,amount)
2 =A1.groupx@g(id;sum(amount);id\1000000)

When using the groupx@g(), the hash method for big grouping that can manually control the segmentation scheme will be adopted. The third parameter is an expression based on the current record, and the computation result is an integer to determine which segment the current record should be divided into. The data in each segment is not large in amount, which can be read into memory once again to perform hash grouping operation. In this example, the id does not have to be a continuous natural serial number, but it is still an integer, and can be used to compute the segment serial number.

The method for serial number can also be used to sort. If the field (or expression) for sorting is originally a natural number, just regard it as serial number and list it out without the need to compare again, and the complexity will be much lower than that of quick sorting.

For in-memory sorting, SPL provides the sort@n()function to fulfil the serial number sorting. Correspondingly, sortx@n() can also be used for big sorting, at this time, a method similar to hash grouping will be used for sorting, which works in a way that each time a batch of data is read, the data will be written into multiple buffer files by segment, then sequentially read data from each buffer file and perform serial number sorting.

A
1 =file(“trades2.btx”).cursor(id,amount)
2 =A1.sortx@n(id)
2 =A1.sortx@n(id;1000000)

sortx@n()also has a parameter similar to groupx@n() to control the segment size.

Similarly, there is also sortx@g(), which can be used to manually set the data segmentation scheme. The big sorting implemented by program cursor discussed in the previous section uses exactly this strategy, equivalent to the following code:

A
1 =file(“orders.btx”).cursor@b()
2 =A1.sortx@g(amount;int(amount\100))

Since the hash value and sorting field may be in different order, sorting cannot use hash value to segment as grouping does. Thus, a monotonical segmentation serial number should be calculated based on the sort field. Essentially, there is no hash sorting, but there will be a similar method like sortx@g().