Performance Optimization - 8.5 [Multi-dimensional analysis] Flag bit dimension

 

The flag dimension, also known as binary dimension, refers to the enumerated dimension with only two values yes or no (or true/false) such as whether a person is married, has attended a college, has a credit card, etc. The flag dimension is very common, and flagging the customers or things is an important means in current data analysis. This dimension is rarely used for grouping and aggregating, but usually for slicing condition.

The data set in modern multidimensional analysis often has hundreds of flag dimensions. If such large numbers of dimensions are handled as the ordinary field, it will cause a lot of waste whether in storage or operation, and it will be difficult to obtain high performance.

There are only two values for the flag dimension, only one bit is required to store them. Since a 16-bit small integer can store 16 flags, one field is enough for storing these flags that originally require 16 fields. This kind of storage method is called the flag bit dimension, which will greatly reduce the storage amount, i.e., the reading amount of hard disk. Moreover, the small integer will not affect the reading speed.

A
1 =file(“T.ctx”).open().cursor()
2 =file("T_new.ctx).create(…)
3 =(n\16).(bits(~.(if(“tag”/(~*16-15),1,0))):“bits”/~)
4 =A1.new(…,${A3.concat@c()})
5 =A2.append@i(A4)

Let n be the number of flag dimensions, and the flag dimensions are named in the form of tag1, tag2,… A3 will generate partial parameters of the new()function to convert these conventional boolean flag dimensions to the data represented in bits. Every 16 tags form a small integer, which are named bits1, bits2…respectively. As a result, bits1 will correspond to the original tag1 through tag16, and bits2 corresponds to tag17 through tag32, and so on. In this code, letting n be a multiple of 16 is for simplicity’s sake, and the … part of new() in A4 refer to other fields, just copy them.

In practice, the field naming and arranging methods may be different, and the number of flag dimensions may not be a multiple of 16, but you can use this code example to rewrite your own code.

Usually, there are multiple slicing conditions for the flag dimension, in this case, the records whose flag dimension values are all true will be taken out. Alternatively, describing from the perspective of flag bit dimension, it is to select the records that make these flag bits 1 simultaneously. This judgment can also be performed by bit operation.

A
1 =file(“T_new.ctx”).open()
2 =tags.splits@c().(int(mid(~,4))-1).group(~\16)
3 =A2.(~.sum(shift(1,-(~%16))))
4 =A3.pselet@a(~!=0)
5 =A4.(“and(bits”/~/“,”/A3(~)/“==”/A3(~)))
6 =A1.cursor(…;${A5.concat(“&&”)})
7 =A6.groups(…)

The condition tags is a comma-separated string composed of the flag field names, such as “tag3, tag8, tag23”, which means to select the data whose values of these three flag dimensions are all true. A2 parses out the sequence numbers and groups them, and each group corresponds to a new flag bit dimension; A3 calculates the value of flag bit dimension corresponding to each group. For example, tag3 and tag8 are grouped into a same group, corresponding to bits1, the value calculated from A3 will be the binary number 0000000010000100 (the lower bit is on the right, the third and eighth bits counted from right to left are 1); A4 takes out the sequence number that is not 0, i.e., the bit dimension sequence number that needs to be judged (no judgement needed for 0); Then, in A5, convert these values to the condition for the flag bit dimension, for example, the condition for this group corresponding to tag3 and tag8 is

and(bits1,132)==132

(132 is the decimal value of the binary number just mentioned)

The and()function is the bitwise AND operation, which can judge whether the 3rd and 8th bits of bits1 (counting from the right) are both 1. Correspondingly, for the original flag, it can judge whether tag3 and tag8 are both true. Therefore, this operation can achieve the judgment for two flag conditions by only once.

The flag bit dimension effectively decreases the amount of storage, and in most cases, the judgments for multiple tags may be combined into one bit dimension, which significantly improves the performance. For in-memory calculations, this method that can greatly decrease the occupation of memory and reduce the amount of judgment to a certain extent is also very meaningful.