When will the pre calculation of customer profile analysis be over
Customer profiling is very fashionable in current business analysis. Simply put, it means putting various tags on customers, using these tags to define different customer groups (so-called profiles), and then calculating the quantity (and changes) of customers in each customer group.
Logically speaking, tags are dimensions or fields of a data table. Tags are fields with relatively simple values, and there are generally two types: binary tags, which have only two values and are usually represented by 0/1, such as marital status or gender. Another type is enumeration tags, with values ranging from a few to a few hundred, which can be represented as small integers, such as educational status or age range. These two types of tags can also be converted to each other.
Profile statistics is nothing special, it is a standard multidimensional analysis task that filters (WHERE) the corresponding tag conditions of the customer group (profile) and then aggregates (mainly COUNT), sometimes with grouping.
However, when the amount of data is particularly large (with a large number of customers, which is the norm, otherwise there is no need for analysis), this statistic may become very slow and often cannot be calculated in real-time. All kinds of profile standards (corresponding tag conditions) need to be investigated in advance and pre calculated before they can be queried by business personnel.
As it goes, business managers may come up with a profile standard at any time to conduct statistics, and adjust the profile standard immediately when they feel there is a problem and make another statistic. This exploratory analysis is the only way to obtain meaningful results. But when pre calculation is necessary, all of this loses its meaning.
Then, why cannot it calculate fast? Normally speaking, this simple calculation has been played very skillfully by the industry.
There are three main reasons for this.
The number of tags is particularly large, especially binary tags, which may be hundreds or even thousands, and are constantly increasing. This quantity often exceeds the maximum number of fields in a database table (usually only 256 or 512), and cannot be directly designed as a table field technically. Therefore, alternative measures should be taken.
There are generally two methods: one is to design multiple tables with customer ID as the primary key, each table storing a portion of the tags, and then JOIN them during the calculation; Another method is to convert columns to rows, number the tags, and design the data structure as {customer ID, tag number, tag value}, and it will involve GROUP BY, HAVING, or COUNT (DISTINCT) in the calculation.
The performance of both methods is not good. JOIN is a long-standing problem in databases, as are GROUP BY and COUNT (DISTINCT). Faced with a huge number of customer IDs, the calculation methods of the database require a large amount of memory, which can easily lead to crashes or the use of buffer to traverse multiple times, resulting in a sharp decline in performance. Another method is to use strings to represent all tag values, with values in the strings such as “Tag 1=Value 1; Tag 2=Value 2; …”. Although this method is flexible, the reading and parsing of strings are very time-consuming, and the computational performance is often not as good as JOIN and GROUP, only it generally does not crash.
The filtering condition for enumerating tags is usually an IN operation. If not optimized, the database needs to compare N times (the length of the IN subset) for IN operation, with a complexity of O(N). Even if the binary search comparison is used after sorting, it still needs to compare logN times, with a complexity of O(logN). Moreover, the binary search method itself has some fixed overhead, and although its complexity has decreased, its optimization effect on a small IN subset (commonly seen in profile analysis) is not significant.
The most crucial reason is that profile analysis often targets multiple profile standards simultaneously. Each profile standard corresponds to different filtering conditions (WHERE) and grouping rules (GROUP), and in the database, an SQL statement needs to be executed to traverse the data table once to calculate the result. When there are multiple profile standards, there will be multiple SQL statements, and the data table will be traversed multiple times. A typical profile analysis interface may have dozens or even hundreds of profile indexes simultaneously, which means that a huge data table needs to be traversed dozens or even hundreds of times, making it impossible to calculate in real time and can only be pre calculated.
If there are good optimization methods in all three aspects, profile analysis can run faster. Unfortunately, most relational databases cannot do so. The first factor is engineering and should have been able to be handled, but due to the closed nature of the database requiring unified metadata management, having too many fields can lead to exceptionally complex metadata, so most databases also refuse to support it; The second factor is semi theoretical and semi engineering. SQL can only use IN to describe the filtering conditions of enumeration tags, and this requires the database optimization engine to recognize and adopt a calculation scheme that can avoid set member comparison, which is also something most databases have never thought of. The third factor is theoretical. A single SQL statement cannot return multiple summary results, and optimization engines cannot assume any correlation between multiple SQL statements. It can be said that all databases can only be forced to traverse multiple times, so as mentioned earlier, this is the most critical factor.
esProc can do all!
Strictly speaking, esProc SPL is not a database, but a professional computing engine. It provides a columnar-storage format file and, based on it, provides computing power that does not rely on a database, which can completely replace the computing capabilities of a data warehouse. Unlike traditional databases, esProc SPL no longer adopts relational algebra and SQL. Instead, it has created the theory of discrete datasets and invented a new programming language SPL, which can code richer optimization logic without being limited by SQL.
The column-number capacity of the esProc SPL columnar storage file is much larger (up to thousands of columns), and it does not have the concept of metadata. It will not cause overall management burden due to the large number of columns in a certain file (table), and can naturally solve the above first factor. Moreover, even if there are too many tags that result in too many columns and a table splitting scheme needs to be adopted, esProc SPL also has a high-performance ordered merge algorithm for this situation. The performance loss caused by multi table association is very small, and the speed is not much different from single table traversal, which mainly depends on the number of tags involved.
Specifically, for binary tags, esProc SPL also supports bitwise storage, which uses a single binary bit to store a tag, combining multiple tags into a 16 or 32-bit or 64-bit integer. This can significantly reduce the number of columns in the table and the storage capacity (which can be reduced by 16-64 times compared to conventional storage methods). When there are many binary tags involved in the profile standard, it can also effectively reduce the amount of data reading and calculation, that is, a 16-bit integer stores multiple binary tags that require to calculate conditions, and can only be read and calculated once. ( Performance Optimization - 8.5 [Multi-dimensional analysis] Flag bit dimension )
Some databases now support bit operations, but SQL syntax writing is still quite cumbersome. esProc SPL provides pseudo table objects that make the operation of combining binary tags transparent. Programmers can continue to operate on individual tag fields, which are actually converted by SPL to certain bits of a 16-bit integer. ( SPL Pseudo Table Data Type Optimization )
For the conditions on the enumerating tags, esProc SPL provides a method of boolean dimensional sequence, which can convert IN calculations into array values, and the complexity is directly reduced to O(1), that is, the calculation time is independent of the length of the IN subset ( Performance Optimization - 8.4 [Multi-dimensional analysis] Boolean dimension sequence )
For multi indexes statistics, esProc SPL also has a syntax of multi-purpose traversal, which can calculate multiple statistical values in one traversal Performance Optimization - 4.2 [Traversal technology] Multipurpose traversal ).This is the key to solving the performance problem of profile analysis.
Here is a multi-index calculation test based on TPCH 100G data generated on a wide table ( SPL computing performance test series: multi-index aggregating ):
4C16G | 8C32G | |||||
---|---|---|---|---|---|---|
No. of indexes1 | 1 | 2 | 3 | 1 | 2 | 3 |
ClickHouse wide table | 77.4 | 156.0 | 249.6 | 34.7 | 69.0 | 106.4 |
Starrocks wide table | 135.7 | 253.6 | 402.6 | 62.2 | 104.6 | 156.2 |
esProc SPL wide table | 114.2 | 119.5 | 124.1 | 57.7 | 61.6 | 64.6 |
esProc SPL Join | 100.5 | 49.5 |
It can be seen that when only one index is calculated, the performance of esProc SPL cannot keep up with ClickHouse. But when multiple indexes are calculated, it will exceed, and the more indexes, the more obvious the advantage. When faced with dozens or hundreds of indexes in profile analysis, the multi-purpose traversal mechanism of esProc SPL will exhibit a rolling advantage.
In practice, the performance of esProc SPL is also excellent, and here are two cases where SPL is used to solve pre-calculation:
Open-source SPL speeds up intersection calculation of customer groups in bank user profile by 200+ times , using the aforementioned binary bits to represent the binary dimension, and boolean sequences to optimize the enumeration tag conditions, provides more than 200 times the performance than the MPP database.
Open-source SPL optimizes bank pre-calculated fixed query to real-time flexible query , further using the multi-purpose traversal technology, real-time calculation of nearly 200 indexes on one same page, completely canceled the pre-calculation, and changed the original business form.
esProc SPL also has a number of engineering advantages:
esProc is pure Java software that can operate in any JVM environment, can be seamlessly embedded in Java programs, giving the computing power of a data warehouse to applications in various scenarios in a very lightweight manner.
esProc provides a visual development environment that supports single step execution, breakpoint setting, and WYSIWYG result preview. Developing and debugging is much more convenient than SQL and stored procedures.
SPL also has comprehensive process control statements, such as for loops and if branches, and supports subroutine calls. It has the procedural ability only available in stored procedures, and can comprehensively replace SQL and stored procedures.
…
Finally, esProc SPL is open source and free. It is here https://github.com/SPLWare/esProc.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version