How to Make Queries and Analyses on Data with Thousands of Tags Run Fast?
A tag is a value denoting true/false or yes/no assigned to a piece of information that helps describing and searching for data. We may use certain tags, such as white collar, active and registered, to describe customers. Sometimes we need to first select data with the specified tags whose values are “true” for further analysis, like counting customers whose “active” and “registered” tags are both true.
Databases usually store tags as boolean type fields. When there are several or dozens of tags, we can simply write the filtering conditions in WHERE clause. In real-world situations, there may be hundreds of thousands of tags. A database table is not allowed to have so many fields. The solution is to split the table into smaller ones and JOIN them as needed. But the join will be very slow when the tables are huge.
An alternative is to convert the thousands of boolean fields into rows under a single “tag_name” field and group the field for the filtering operation and more computations as needed. The grouping result set, however, is huge and requires hard disk buffer, leading to extremely poor performance.
If we use the binary number system to store tags (0 represents No and 1 represents Yes), a 16-bit small integer can store 16 tags and 100 small-integer-type fields can store 1600 tags. This can effectively reduce the data size and avoid joins between large tables.
Databases do not support the bit-based storage scheme that condenses one column into one bit and the relative computations. They cannot be used to achieve this optimization strategy.
esProc SPL supports more storage schemes and can express more algorithms, making it convenient to implement the strategy.
1. Preprocess data to convert to bit-based storage scheme.
The code is =cs.new(…,bits(flag1,flag2,…,flag16):f1). It converts the original 16 tag fields from flag1 to flag16 to a 16-bit integer by assigning values 0 and 1 to them using the bits function, that is, transforming the 16 fields to one field.
Suppose flag4=1 and flag7=1 and other flags are 0, the result f1 is a binary number 0001001000000000, which can be noted down as 4608 in decimal number system. Values of the 16 tags are all represented by the one integer alone.
2. Perform tag filtering and further analysis on the preprocessed data.
Suppose we are trying to get data where values of tags falg4 and flag8 are 1, we need to check whether the 4th bit and the 8th bit in the converted f1 field are 1. It is easy to know using the bitwise operation, which compares the binary number 0001000100000000 with f1 bit by bit and gets that the corresponding bits are all 1 if the result is still the binary number – that is, and(f1,4352)==4352, where 4352 is the integer in decimal number system converted from the binary number 0001000100000000.
We can code the filtering operation on tags as follows:
=file("T.ctx").open().cursor(id,…;and(f1,4352)==4352 && …)
In this way, only a simple bitwise operation is enough to find which records match the specified value for the tags without performing any JOINs between large tables, improving performance dramatically.
Tests show that, under the same hardware environment, SPL is 200 times faster than Spark SQL in achieving this strategy.
Find detailed implementation methods in SQL Performance Enhancement: Binary Flag