Performance Optimization - 5.7 [Ordered traversal] Serial number grouping and controllable segmenting

 

Performance Optimization - 5.6 [Ordered traversal] Second-half ordered grouping

SPL provides an in-memory serial number grouping and aggregating function groups@n, and a same function for external storage cursor.

If the grouping key value can be easily calculated as serial number, then there is no need to calculate and compare hash values when grouping. Instead, you can directly use serial number to find the group (to do aggregation and accumulation), and the performance will be better.

Note that the calculation must be simple. If it is too complex (for example, you need to search a table to find the serial number), it’s better to calculate the hash value instead.

A
1 =file(“trades.ctx”).open().cursor(dt,amonut)
2 =A1.groups@n(month(dt);sum(amount))

In this way, the transaction amount can be grouped and aggregated efficiently by month. This method is suitable for things that can be easily calculated as serial number such as month or year. In addition, we can use the data type conversion scheme discussed in Chapter 2 to quickly calculate the serial number of year and month from the integerized date.

The groups@n() also supports multi-cursor.

The groups() will return small groups, can we still use serial number to improve performance for big grouping?

Of course, it is theoretically possible. Actually, this is a simplified hash method for big grouping. Since the grouping key itself is a hash value, it saves the time for computation and comparison, and the returned cursor will naturally be ordered by the grouping key.

A
1 =file(“trades2.btx”).cursor(id,amount)
2 =A1.groupx@n(id;sum(amount))

Assuming that the account id is consecutive natural number, this code can calculate the transaction amount of each account.

However, unlike the significant performance advantage of using serial number grouping over hash grouping for small grouping, serial number grouping is not significantly faster than hash grouping for big grouping. The reason is that the time for big grouping is mainly spent on writing and reading the buffer files, and the time difference between using serial number and hash value to locate the grouped records in memory can be neglected relative to the read and write time of buffer file.

Compared with conventional hash grouping, the significance of groupx@n() is that that it eliminates the possibility of ‘bad luck’ in hash grouping that may require a second round of hashing.

When performing hash grouping, knowing the range of hash values (which can be determined by hash function) can help to reasonably control the number of segments (let’s review the hash method for big grouping, segment the hash values first, and then write the corresponding groups into different buffer files). For serial number grouping, however, the maximum serial number is unknown, and the system may be conservative when it estimates the serial number by itself, resulting in too many segments, which will also affect performance. Therefore, group@n() adds a parameter that allows manual control of the segmentation rules:

A
1 =file(“trades2.btx”).cursor(id,amount)
2 =A1.groupx@n(id;sum(amount);1000000)

The third parameter of groupx@n()represents the number of serial numbers (i.e., grouped records) contained in each segment, which allows programmers to control the number of segments according to data characteristics and memory capacity.

For grouping not by serial number, manual control of segmentation is also meaningful. To this end, SPL provides another option:

A
1 =file(“trades2.btx”).cursor(id,amount)
2 =A1.groupx@g(id;sum(amount);id\1000000)

When using the groupx@g(), a hash-based big grouping algorithm that can manually control the segmentation scheme will be adopted. The third parameter is an expression based on the current record, and the calculation result is an integer to determine which segment the current record should be divided into. The data amount of each segment is not large, and can be read into memory again for further hash grouping. In this example, the id does not have to be consecutive natural serial number, but still an integer, which can be used to calculate segment’s serial number.

The serial number method can also be used for sorting. If the fields (or expression) for sorting are natural numbers, then we just need to list them in order without the need for comparison, and the complexity will be much lower than that of quick sorting.

For in-memory sorting, SPL provides the sort@n()function to fulfil serial number sorting. Correspondingly, sortx@n() can be used for big sorting. To be specific, use a method similar to hash grouping to sort first, and then write data into multiple buffer files by segment after reading a batch of data, and finally read the data of each buffer file in order and perform serial number sorting.

A
1 =file(“trades2.btx”).cursor(id,amount)
2 =A1.sortx@n(id)
2 =A1.sortx@n(id;1000000)

sortx@n()also has a parameter similar to groupx@n() to control the segment size.

Similarly, there is also sortx@g(), which can be used to manually set the data segmentation scheme. The big sorting implemented by program cursor discussed in the previous section uses exactly this strategy, equivalent to the following code:

A
1 =file(“orders.btx”).cursor@b()
2 =A1.sortx@g(amount;int(amount\100))

Since the hash value and sorting field may be in different order, sorting cannot use hash value to segment as grouping does. Thus, a monotonical segmentation serial number should be calculated based on the sort field. Essentially, there is no hash sorting, but there is a similar method like sortx@g().


Performance Optimization - 5.8 [Ordered traversal] Index sorting
Performance Optimization - Preface