Performance Optimization - 8.5 [Multi-dimensional analysis] Flag bit dimension
Performance Optimization - 8.4 [Multi-dimensional analysis] Boolean dimension sequence
Flag dimension, also known as binary dimension, refers to the enumeration dimension with only two values: yes or no (or true/false), such as whether a person is married, has attended college, has a credit card. Flag dimension is very common. Flagging customers or things is an important means in current data analysis. This dimension is rarely used for grouping and aggregating and is usually for slicing condition.
The data set in modern multidimensional analysis often involves hundreds or even thousands of flag dimensions. If such large numbers of dimensions are treated as ordinary field, it will cause a lot of waste whether in storage or operation, making it difficult to achieve high performance.
The flag dimension has only two values, and it only requires one bit to store. A 16-bit small integer can store 16 flags, and the information that originally required 16 fields to store can now be stored in just one field. This storage method is called flag bit dimension, which will greatly reduce the storage amount, that is, the read amount of hard disk. Moreover, small integer does not affect the reading speed.
A | |
---|---|
1 | =file(“T.ctx”).open().cursor() |
2 | =file("T_new.ctx).create(…) |
3 | =n.(“tag”/~).group((#-1)\16).(“bits@b(”/~.concat@c()/“):bits”/#) |
4 | =A1.new(…,${A3.concat@c()}) |
5 | =A2.append@i(A4) |
Let n be the number of flag dimensions, and the flag dimensions are named tag1, tag2, and so on. A3 will generate partial parameters of the new()function to convert these conventional boolean flag dimensions to the data represented by bits. Every 16 tags form a small integer, named bits1, bits2…respectively. As a result, bits1 will correspond to the original tag1 through tag16, and bits2 corresponds to tag17 through tag32, and so on. For simplicity, this code assumes that n is a multiple of 16, and the … part of new() in A4 refer to other fields, just copy them.
In practice, the naming and arranging methods of fields may be different, and the number of flag dimensions may not be a multiple of 16. However, you can use this code example to write your own code.
There are usually multiple slicing conditions for the flag dimensions. If we use the flag bit dimension to describe the action of taking the records whose flag dimension values are all true, it means selecting the records that make these flag bits all 1 at the same time. This judgment can also be done using bit operation.
A | |
---|---|
1 | =file(“T_new.ctx”).open() |
2 | =tags.splits@c().(int(mid(~,4))-1).group@n(~\16+1) |
3 | =A2.(~.sum(shift(1,~%16-15))) |
4 | =A3.pselet@a(~>0) |
5 | =A4.(“and(bits”/~/“,”/A3(~)/“==”/A3(~))) |
6 | =A1.cursor(…;${A5.concat(“&&”)}) |
7 | =A6.groups(…) |
The condition tags is a comma-separated string composed of flag field names. For example, “tag3, tag8, tag23” means that the data with these three flag dimensions being true are to be selected. A2 parses out the sequence numbers and groups them, with each group corresponding to a new flag bit dimension. A3 calculates the values that the flag bit dimension corresponding to each group should have. For example, tag3 and tag8 will be grouped together, corresponding to bits1, and the corresponding value calculated in A3 is the binary number 0010000100000000 (the third and eighth bits are 1). A4 takes out the sequence numbers that are not 0, which are the bit dimension sequence numbers that need to be judged (those that are 0 do not need to be judged). A5 converts these values to the condition for the flag bit dimension. For example, the condition for this group corresponding to tag3 and tag8 is:
and(bits1,8448)==8448
(8448 is the decimal value of the binary number just mentioned)
The and() function is a bitwise AND operation, which can judge whether the 3rd and 8th bits of bits1 are both 1. Correspondingly, for the original flag, it can judge whether tag3 and tag8 are both true. Therefore, this operation can achieve the judgment of two flag conditions at a time.
The flag bit dimension effectively reduces the amount of storage. In most cases, it can combine the judgments of multiple tags into one bit dimension, which significantly improves performance. For in-memory calculations, this method that can greatly reduce the occupation of memory and reduce the amount of judgment to a certain extent is also very meaningful.
Performance Optimization - 8.6 [Multi-dimensional analysis] In-memory flag change
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL