Performance Optimization - 8.1 [Multi-dimensional analysis] Partial pre-aggregation

 

Essentially, the backend operation of the multi-dimensional analysis is the grouping and aggregating calculation, and the grouping methods mentioned earlier can all be used. However, when the amount of data is very large, it is not easy to achieve instant response.

To solve this problem, an easy way to think of is the pre-aggregation, that is, calculate the aggregation results in advance so as to return them directly when requested by the frontend. In other words, this method is to trade space for time, which is equivalent to converting the traversal problem to a search problem, and can achieve the instant response theoretically.

However, the full pre-aggregation is basically infeasible, and we can know the reason through simple calculations.

To do the full pre-aggregation for 50 dimensions, 250 intermediate result sets will be required, and a conservative estimate on the required capacity is over one million terabytes, hence it is not operable. Even if only 10 dimensions are pre-aggregated (10 dimensions are not large in number since we need to pre-aggregate both the observation and slice dimensions), it still requires at least hundreds of terabytes of storage space, therefore, the practicality is very poor.

What we can do is only the partial pre-aggregation, that is, only pre-aggregate the combination of some dimensions. When the front-end requests, search the pre-aggregation data under a certain condition first, and then do the aggregation. In this way, the performance can probably be improved by dozens of times on average, and it can often meet the requirements.

Which dimension combinations to pre-aggregate generally depends on practical experience. In addition, the engineering means can also be used to determine which combinations to pre-aggregate. For example, you can either record the historical query requests and do the statistical analysis, or dynamically generate new ones and delete the infrequently used ones. When it comes to the algorithm, there is not much to discuss.

If there are multiple pre-aggregated data, which one to choose to respond to frontend request?

Suppose that the frontend requests the statistical values for the dimensions A,B, it needs to search the pre-aggregated data that includes dimensions A,B. If there are multiple pre-aggregated data that meet the criterion, just select the one with the smallest amount of data and then calculate.

These logics are relatively simple.

SPL provides partial pre-aggregation functions for the composite table:

A
1 =file(“T.ctx”).open()
2 =A1.cuboid(Cube1,D1,…;sum(M1),…)
3 =A1.cuboid(Cube2,D1,…;sum(M1),…)

Using the cuboid() function can create the pre-aggregation data. You need to give a name, and the remaining parameters are the same as that of grouping and aggregating algorithm.

It is also very simple to use:

A
1 =file(“T.ctx”).open()
2 =A1.cgroups(D1,…;sum(M1),…)

The cgroups() function will automatically search the most appropriate pre-aggregation data according to the above logic before calculation.

The pre-aggregation scheme is quite simple, but it is limited by the capacity and has many application limitations. Thus, it can only deal with the most common situations.

It is difficult to fully apply the pre-aggregation scheme in the following situations:

1) Unconventional aggregation: in addition to the common summation and count operations, some unconventional aggregations such as count unique, median and variance, are likely to be omitted and cannot be calculated from other aggregation values. In theory, there are countless kinds of aggregation operations, and it is impossible to pre-aggregate them all.

2) Aggregation combinations: the aggregation operations may be combined. For example, we may want to know the average monthly sales, which is calculated by adding up the daily sales of a month and then calculating the average. This operation is not a simply counting and averaging operation, but a combination of two aggregation operations at different dimension levels. Such operations are also unlikely to be pre-aggregated in advance.

3) Conditions on metrics: the metrics may also have conditions during the statistics. For instance, we want to know the total sales of orders with transaction amount greater than 100 yuan. This information cannot be processed during pre-aggregation, because 100 will be a temporarily entered parameter.

4)Time period statistics: time is a special dimension, which can be either enumerated or sliced in a continuous interval. The starting and ending points of query interval may be fine-grained (for example, to a certain day), in this case, the fine-grained data must be used for re-counting, rather than using the higher-level pre-aggregation data directly.