Performance Optimization - 8.4 [Multi-dimensional analysis] Dimension of boolean sequence

 

Performance Optimization - 8.3 [Multi-dimensional analysis] Redundant sorting

In the previous chapter, we used the aligned sequence to improve the association after the dimension table filtering. This technique can also be used to improve the slicing performance of the enumerated type dimension.

The so-called enumerated type dimension means that the value of the dimension is a limited number of values that have been determined in advance such as gender, place of origin; In multidimensional analysis, almost all dimensions are the enumerated type dimension except for the time dimension.

Generally, the condition for slicing (more precisely, dicing) of this type of dimension is to give a dimension value set, and then filter out all the records whose dimension value is in this set. Write it in code as follows:

T.select( V.contain( D ) )

where, T is the data table, D is the slice dimension, i.e., a field, and V is the dimension value list used as condition.

The calculation of contain() will not be fast. When V is not large, the sequential search will be used; when V is larger, the binary search will be used, but it still needs to compare several times to determine whether the data is in the slice. If the data type is complex, the performance will be worse.

We first convert the enumerated type dimension to integers (generally, the number of dimension value will not be too many, and most of them can be converted to small integers less than 65535), and then convert the slice condition to the aligned sequence composed of boolean values during query. In this way, the slice judgment result can be taken directly from the specified position of the sequence during comparison, avoiding complex contain() operation.

Convert dimension to integers:

A
1 =file(“T.ctx”).open()
2 =file("T_new.ctx).create(…)
3 =DV=A1.cursor(D).id(D)
4 =A1.cursor().run(D=DV.pos@b(D))
5 =A2.append@i(A4)

In A3, all possible values DV of dimension D will be calculated. Usually, such values are known in advance (such as gender, city, etc.), and there is no need to traverse the table. Why it is written this way here is to indicate its logical meaning. In A4, the DV will be employed to convert the dimension D to integers. The DV will be saved separately for use during query.

Code for slice aggregation:

A
1 =file(“T_new.ctx”).open()
2 =DV.(V.pos(~))
3 =A1.cursor(…;A2(D))
4 =A3.groups(…)

A2 converts the parameter V to a boolean sequence with the same length as DV. When a member of DV is in V, the member at the corresponding position in A2 will be non-null (playing the role of true when judging), otherwise it will be filled in as null (i.e., false). Then, when traversing to perform slicing, just use the dimension D that has been converted to integer as the sequence number to take the member of this boolean sequence. If it is non-null, it indicates that the original dimension D belongs to the slice condition list V. The operation complexity of taking the value by sequence number is far less than that of contain(), which greatly improves the slicing performance.

You can do the same for all enumerated type dimensions.


Performance Optimization - 8.5 [Multi-dimensional analysis] Flag bit dimension
Performance Optimization - Preface