Performance Optimization - 8.3 [Multi-dimensional analysis] Redundant sorting


The aggregation operation without the slicing condition always involves all data. If the data is not pre-aggregated, there is no way to reduce the amount of calculation. However, when there are slicing conditions, if the data are organized reasonably, it may not be necessary to traverse all the data.

By simply creating an index on the dimension will have some effect, but just a little. Although using the index can quickly locate the records that meet the condition, if the physical storage location of these data is not continuous, there will still be a lot of waste while reading. When the target data are too scattered, using index may not be much better than full traversal. The reason is that in multidimensional analysis operation, even if slicing is performed, the amount of data to be read is still very large. The main application scenario of index is often to select a small amount of data.

If the data are stored orderly by a certain dimension, the slicing condition of this dimension can be used to limit the target data to a continuous storage area, in this way, there is no need to traverse all the data, and the amount of reading can be effectively reduced. However, each dimension may have a slicing condition in theory, if the data are sorted by each dimension, it is equivalent to being copied several times, as a result, the cost of such storage is somewhat high.

A compromise is to store two copies of data sets, that is, one copy is sorted by dimensions D1,…,Dn and stored, and the other copy is sorted by Dn,…,D1 and stored. In this way, although the amount of data will be doubled, it is still acceptable. For any dimension D, there will always be a data set that makes D in the first half of its sorted dimension list. If it is not the first dimension, the data after slicing will generally not be concatenated into a single area, but the data are also composed of some relatively large continuous areas. The closer to the top the position of dimension in the sorted dimension list, the higher the degree of physical order of data after slicing will be.

While calculating, it is enough to use one dimension’s slicing condition to filter, and the conditions of other dimensions are still used for traversal calculation. In multidimensional analysis, the slice on a certain dimension can often reduce the amount of data involved by several times or tens of times. It will be of little significance to reuse the slicing condition on other dimensions, and it is also very difficult to implement. When multiple dimensions have slicing conditions, we usually choose the dimension whose range is smaller than the total value range after slicing, which often means that the filtered amount of data is smaller.

The cgroups()function implements this selection. If there are multiple pre-aggregated data sorted by different dimensions and there are multiple slicing conditions, the most appropriate pre-aggregated data will be selected. When cuboid() creates the pre-aggregated data, the order of grouped dimensions is meaningful as different pre-aggregated data will be created for different dimension orders.

It is also possible to manually select a properly sorted data set with code, and store more sorted data sets.

The redundant sorting method is not only appropriate for multidimensional analysis, but also for the conventional traversal with filter condition. The reason for taking multidimensional analysis as an example is that the relevant features of this method will be more obvious while slicing the dimensions in multi-dimensional analysis, and it is more suitable to explain.