How to implement fast multi-index calculation

 

In statistical analysis application, various indexes calculated from detailed data are important data to support business. However, if you want to implement a fast and flexible multi-index calculation, the backend data source will face several challenges.

One of the challenges is that the amount of detailed data involved in multi-index calculation is very large. Nowadays many sectors, such as government, finance, energy and industry, are constantly generating large amount of detailed data, and the scale of detailed data can reach tens of millions or even up to 100 million. To calculate multiple indexes based on such a large data scale and obtain a second-level response, it is a considerable challenge for both traditional database and big data technology.

Another challenge is that the number of indexes that need to be calculated simultaneously is very large. It is common to see a management page containing dozens of or up to 100 indexes for business staff to reference, and many optional parameters for calculating different index values. Moreover, the number of business staff is usually large, when they access such page simultaneously during peak hours, a lot of computing requests are made to backend.

Under the joint influence of the two challenges, it is very difficult to ensure the performance of multi-index calculation. Since it is difficult for existing database and big data technologies to achieve second-level response, most index statistics projects have to abandon the real-time computing scheme, and adopt a pre-calculation method instead, that is, calculate the indexes in advance and store them in backend for business staff to query. In this way, the calculation of indexes becomes simple random search, and a second-level response can be achieved.

However, this method faces a serious problem. In the process of index calculation, there will be many kinds of filtering conditions, many types of grouped fields, and many ways of aggregation. Since the number of combinations from these conditions, fields and ways is staggering, the number of formed indexes will also be staggering. Let's take common indexes (order amount, number of transactions) as an example: the filtering conditions include the time, date, transaction amount, channel, etc., the grouping types include the customer, area, product, etc., and the aggregation ways include the summation, averaging, maximum, minimum, topN, etc. In this case, we want to find the orders with amount between $10,000 and $20,000 spent in North America yesterday, and then group them by customer and product, and finally aggregate the average order amount and the total number of orders. You can see that there are many prerequisites, once the prerequisites are slightly changed, it will generate a new index. Therefore, the total number of indexes is very large, and the number of indexes that need to be pre-calculated is too many to store all results in backend.

In the face of this problem, most of the projects have to abandon most indexes and pre-calculate only a small portion of indexes, and even then, it takes up a lot of hard disk space. As a result, business staff has to abandon flexible query demands on more indexes, which will greatly reduce the support role of index statistics project on business.

So, is there a method to achieve a second-level response while implementing flexible real-time calculation? To achieve this goal, we first need to understand the characteristics of multi-index calculation: the amount of detailed data is very large, and they are usually stored in external storage, and the most important process of multi-index calculation is to traverse the large table in external storage so as to perform the filtering, grouping and aggregation calculations. We know that traversing large file stored on hard disk is time consuming. If the calculation of each index needs to traverse large table once, then the IO of hard disk will become a bottleneck, and it is impossible to achieve a second-level response when calculating multiple indexes. If there is a method to calculate multiple indexes through one traversal, the number of traversals will be greatly reduced and the performance of multi-index calculation will be effectively improved.

However, SQL, which is commonly used in relational databases and big data platforms, does not provide such multipurpose traversal syntax and cannot write such operations, and instead, it has to traverse multiple times or place hope on the automatic optimization function of database engine. But, actual test proves that even the Oracle database with better optimization engine cannot perform multipurpose traversal calculation automatically, and will still traverse the data table many times when doing the grouping calculation.

esProc SPL, a professional data calculation engine, supports multipurpose traversal syntax, and can calculate multiple grouped results in one traversal, thus implementing real-time index calculation. Take the order index mentioned above as an example, the following code can calculate multiple indexes in one traversal:


A

B

1

=file("orders.ctx").open()

2

=A1.cursor@m(odate,customer,product,area,amount)

3

cursor A2

=A3.select(odate==… && area=="NorthAmerica" &&…)

4


=B3.groups(customer;sum(amount),count(1)…)

5

cursor

=A5.select(odate>=… && customer!=… &&…)

6


=B5.groups(area;avg(amount),top(amount,100),…)

7

cursor

=A7.groups(year(odate),product;avg(amount),min(amount)…)

8

A1: open the order composite table. A2: create multiple cursors, indicating the name of the fields to be fetched in parallel.

A3: define a channel for cursors of A2. B3: define a filtering calculation for the channel. B4: group and count the filtering results to obtain the first group of indexes.

A5-B6 also defines a channel for cursors of A2, and group and count the filtering results to obtain the second group of indexes. The difference is that the filtering condition, and the way of grouping and counting are different.

For the calculation of the top 100 order amount in B6, SPL regards it as an aggregation operation, which has great performance advantages over the big sorting algorithm in SQL. For details, visit: How to Make TopN Operations Fast?

A7-B7 defines the third channel for cursors of A2, and directly group and count without filtering to obtain the third group of indexes.

When SPL executes this code, it only needs to traverse the cursors of A2 once to accomplish the calculation of all channels and obtain multiple groups of indexes, thus implementing the multipurpose traversal mechanism.

This example is the multipurpose traversal of a single table, SPL also supports high performance table association. For details, visit: How to make JOIN run faster?

For indexes that require de-duplication calculation, such as calculating the number of de-duplicated customers, SPL provides the ordered storage and fast de-duplication technologies, visit: SQL Performance Enhancement: DISTINCT & COUNT(DISTINCT) on Big Data for details.

These high-performance algorithms can be implemented in SPL through very simple code. When these algorithms are used in conjunction with multipurpose traversal, the overall performance can fully meet the requirements of real-time index computing.