What is the key to make the multi-tag user profile analysis run faster?

 

The analysis of user profile needs to use many tags to describe the user attributes. Usually, there are two types of tags, one of which may have multiple values, for example, the educational background tag of user includes the middle school, university, graduate, Ph.D., etc., and the age range tag includes the children, juvenile, youth, middle age and old age, we call this type of tag the enumeration tag. Another type of tag has only two values, yes and no. For this tag, we can use yes/or to describe the user, for example, use yes or no to describe whether the user is registered, active, white-collar worker, target user of a certain promotion program, etc. Such type of tag is called the binary tag.

In the user profile analysis scenarios, it often needs to perform a filtering calculation on the combined conditions in these two types of tags, for example: find out the users who meet the following conditions: middle-aged, university educational background, registered, active, targeted user during last year's Black Friday Sale.

When the total amount of data is huge, the operation performance depends primarily on the conditional filtering. Since these conditions are very arbitrary, and we can neither calculate in advance nor count on the index, an efficient hard traversal capability is required. In this case, the storage and calculation methods for the said two types of tags are critical.

In the relational database and data warehouse, the enumeration tag is just an ordinary field, and the corresponding filtering calculation is performed with IN operator in the WHERE clause, generally in the form of d IN (d1,…,dn), that is, it is true when the value of field d belongs to the value set {di,..}. The performance of IN calculation, however, is poor, chiefly owing to the fact that there are too many comparison operations during the calculation. To determine whether the field d belongs to the value set, if the sequential search is used, it needs to perform the comparison calculation between d and the members of value set by 1 to n times. Even when the values in the set are in order, and the binary search is used, it still needs to compare by many times. When the amount of data is huge, the number of comparisons will be very large, and the speed of judging IN will be very slow, and the larger the value set, the slower the speed.

The key to optimize the filtering performance of the enumeration tags is to eliminate the comparison operation. To do so, first determine the list of possible values of the IN field (the field before writing as IN condition). The number of possible values is usually not too large, so the list will not be too long. Next convert the original data, replacing the value of IN field with the sequence number (position) of the corresponding record in the list, and saving these sequence numbers as a new data.

When making an IN judgment on the new data after replacement, a boolean value set with the same length as the list needs to be generated first, and the ith value of the set is determined by judging whether the ith member of the list is contained in the value set of IN field, it is true if contained, otherwise it is false. When traversing, use the value of the IN field (the sequence number in the list) to take the member of boolean value set. If it is true, it meets the filtering condition, otherwise it does not meet.

This method essentially converts “comparison with set value” to “reference of sequence number”, which eliminates the comparison calculation and greatly improves performance. Moreover, the computation time is independent of the value set size and won’t increase with the increasing of number of enumeration values in IN condition.

It’s a pity that SQL does not support the method of directly accessing the set members by sequence number (position) in general, but use a transitional approach, i.e., using the associated table to access set members, which will cause more complex JOIN operations, and hence the said optimization method cannot be directly implemented.

Generally, the binary tags are stored in the database using boolean field. If there are only a few or dozens of tags, simply write the filtering condition in WHERE. However, there may be hundreds of tags in total, and the tables of many databases do not support so many fields. In this case, it needs to divide such table into multiple tables and then to perform JOIN operation. The performance of JOIN operation on large tables is very poor when the amount of data is huge.

In order to avoid the JOIN on large tables, thousands of boolean fields can be converted from columns to rows and stored in one “tag number” field. When calculating, group them first, then filter and count. However, the grouped result set is very large and needs the external storage to buffer the data, so the performance is still poor.

If the binary bits of an integer are used to store binary tags (0 and 1 each represent a value), then a 16-bit short integer can store 16 tags, and 100 integer fields can store 1600 tags, which can effectively reduce the number of fields and avoid the JOIN on large tables.

However, many databases do not yet support such bit-related computation, thus they cannot implement this performance optimization method.

The open-source data computing engine SPL supports the sequence number reference and bitwise operation, and can easily implement the above optimization method. The corresponding SPL code is also very simple. For example, the fields in the original data table T_ordinary include: user ID, enumeration tag field dName (such as age range: children, juvenile, youth, middle age, old age), 16 boolean flag fields flag1 to flag16, and the amount field amt. Among these fields, the value range of dName is in the option table dim. The following code can achieve the conversion between sequence number reference and bitwise storage:


A

1

=file("T_ordinary.ctx").open().cursor(id,dName,flag1,flag2,…,flag16,amt)

2

=T("dim.btx")

3

=A1.new(id,dim.pos@b(dName):d,bits(flag1,flag2,…,flag16):b,amt)

4

=file("T.ctx").create(id,d,b,amt)

5

=A4.append(A3)

A3 uses the pos function to replace the value of dName with the sequence number in dim and save as a new field d. The dim is ordered by dName in advance, and using the binary search here will be faster. Also, use the bits function to convert the 16 flag fields to one 16-bit integer field b.

After converting to table T, high-performance tag filtering and counting can be done. For example, when the filtering condition is that the values of dName is in the set ["middle age","old age"] passed in from the front end, and the flag flag4 and flag8 are 1, after filtering, we want to group by d and count the amount and the number of records, the code is roughly like this:


A

1

=bits(0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0)

2

=T("dim.btx").(["middle age","old age"].contain@b(~))

3

=file("T.ctx").open().cursor(amt;A2(d) && and(b,A1)==A1)

4

=A3.groups(d;sum(amt),count(~))

A1 uses the bits function to generate a 16-bit small integer, the 4th and 8th bits are 1, corresponding to flag4 and flag8; A2 generates a boolean value set; A3 uses the set and the small integer for filtering calculation.

After using SPL's pseudo table, these converted fields can also be transparent and use them directly like ordinary field. For example, after defining the pseudo table T_pseudo based on table T, the above code will roughly be changed to:


A

1

=T_pseudo.select(flag4 && flag8 && ["middle age","old age"].contain@b(dName))

2

=A3.groups(dName;sum(amt),count(~))

The flag4 and flag8 are the bit dimension fields defined in the pseudo table, corresponding to the 4th and 8th bits of the field b in the table T, while the dName is the enumeration dimension field in the pseudo table, and its value corresponds to the name of sequence number of field d in table T.

With the pseudo table, the actual storage and calculation methods remain unchanged, and SPL will automatically carry out the above algorithm. Moreover, ordinary boolean values can be used in the filtering condition, and the grouped values in the result set will also change to easy-to-read strings, and there is no need to convert between sequence number and name. For the specific usage of pseudo table, please visit: .