SQL Performance Enhancement: Count N Earliest Events in each Group
[Abstract]
The essay analyzes the fundamental reason behind slow SQL statements and provides the solution to make improvements through code samples. Click SQL Performance Enhancement: Count N Earliest Events in each Group for details.
Problem description
Counting N earliest events in each group is an in-group time-series operation, and the data to be counted usually contain the grouping field, the time field of events and other information. For example, the grouping field of user behavior analysis is the user ID, and the events have type attributes (like page view, add to card, and so on). Upon the information, user behavior analysis counts the number of the earliest N events per type and finds the event type which happens the most to the users. The gross data volume of the in-group time-series calculation is usually quite large, which needs to be stored externally, while the data of each group are not that big, which can be hold in memory.
Suppose table T contains the grouping field gid, the field of event type event, and the time field etime with each gid corresponding to several events. The calculation requirement is to group the records of table T, which satisfy a certain condition (the condition changes based on the specific situation and may include the given parameter variables), by gid and find the groups with no less than N events. Then the event types are grouped in time order to count the number. N is not big in general, and so is the result set.
SQL writes as follows if N=5:
with t1 as (select gid,
MAX(CASE WHEN event_number = 1 THEN event ELSE NULL END) AS e1,
MAX(CASE WHEN event_number = 2 THEN event ELSE NULL END) AS e2,
MAX(CASE WHEN event_number = 3 THEN event ELSE NULL END) AS e3,
MAX(CASE WHEN event_number = 4 THEN event ELSE NULL END) AS e4,
MAX(CASE WHEN event_number = 5 THEN event ELSE NULL END) AS e5
from (select gid,event,
ROW_NUMBER()over (PARTITION BY gid ORDER BY etime ASC) AS event_number
from T
where …)
where event_number <=5
group by gid
)
select t1.e1,t1.e2,t1.e3,t1.e4,t1.e5,count(*) as c
from t1
where t1.e2 IS NOT NULL AND t1.e3 IS NOT NULL AND t1.e4 IS NOT NULL AND t1.e5 IS NOT NULL
group by 1,2,3,4,5
Database has to execute the sub-query of the most inner layer firstly. It filters part of the table T data based on a certain condition, then groups them by gid with window function, and within each group sorts them in time order to calculate the sequence number. In the middle-layer sub-query t1, database groups and aggregates the data by gid, and the event type from 1 to 5 are converted to five fields. In the most outer layer query, database filters the records that are not empty in last four fields, and then groups them by five fields to count the number. Generally speaking, the result set of grouping the filtered table T by gid tends to be big. To perform big group requires buffering on external storage, which results in bad computation performance.
Solution
1. Presorting and order-based algorithm
We first presort the original data orderly according to the grouping field. The algorithm traverses the ordered data satisfying the condition and reads the data of each group into memory in turn. If the current group length is less than N, it will be skipped and the next group will be directly read; if not, we will sort the data of the current group in time order, then group and aggregate the event types of first N records. This method traverses the data only once and avoids performing grouping with a big result, thus improving the performance greatly.
We can also sort the original data by both grouping field and time field beforehand and spare the resorting of each group, which improves the performance less effectively, but simplifies the code obviously.
In fact, many grouping fields of in-group time-series calculation are certain (like the user ID and the account number) rather than randomly chosen. Sorting data by these certain fields can help many situations of in-group time-series calculation. Other in-group time-series calculations only differ in the specific algorithms, which will be introduced on their own.
Presorting is slow yet one-off and holds only one kind of storage without redundant data.
SQL, based on the unordered set, cannot ensure the data of each group are orderly stored. Therefore, it cannot directly use the order-based algorithm. In addition, the code-writing order and the execution order in SQL are not the same, the difficulties in both writing and reading. The code will be notably simplified if the writing order is exactly the execution order, making it easier to find the weakness of performance and accelerate the computation.
2. Newly-added data
Newly-added data are not always ordered by the grouping field, so they should not be simply appended to the end of the already ordered data. And it will be very time-consuming to directly do an ordinary full sorting of the existing data and the new data together.
Instead, we can sort the newly-added data first and then, along with the already ordered data, perform low-cost order-based MERGE algorithm to generate a new ordered data table. In this way, the overall degree of complexity equals to reading and writing the whole data once, which avoids the frequent temporary external buffering in ordinary full sorting, thus improving the performance greatly.
Furthermore, we can keep a small-scale ordered table additionally (hereinafter referred to as patch data), which will merge with the newly-added data while keeping the old ordered data unchanged. The patch data will merge with the old ordered data until they accumulate to a suitable size after a while. When performing the time-series calculation in the group, the algorithm reads from the old ordered data and patch data respectively, merges them and then calculates. In this manner, the performance will be a bit slower compared with handling one table of ordered data, but the data orderliness can still make the computation much faster.
The time when to merge the patch data with the old ordered data is related to the cycles when data should be added. If new data are added every day, we can do the merge once a month. The patch data store data of one month at most and the existing ordered data contain all data one month ago. That is, the patch data may be far smaller than the old ordered data, so the data to be merged every day is relatively small and the data appending is fast. As we do the whole-data order-based merge only once a month, it is fairly acceptable that it takes a littler longer to complete.
Code sample
1. Count the number of the earliest 5 event types in each group when the data is ordered by grouping field and disordered by time field.
A |
|
1 |
=file("T.ctx").open().cursor(gid,etime,event;…) |
2 |
=A1.group(gid).select(~.len()>=5).(~.sort(etime)) |
3 |
=A2.groups(~(1).event:e1,~(2).event:e2,~(3).event:e3,~(4).event:e4,~(5).event:e5;count(1):c) |
A1: open the composite table T.ctx and create a cursor based on it. The where condition is written after the semicolon.
A2: first define the cursor with order-based grouping by gid. Then define the calculations of filtering and sorting, skip the groups whose length is less than 5, and sort the data in each group by etime. Here only the calculation on the cursor are defined without actual traversing.
A3: traverse the cursor, execute the calculations defined in A2 on each gid group, then group and aggregate based on the events of the first 5 records in each group (to count). The result set is not large, so the groups which returns a small result set is adopted.
2. Count the number of the first 5 event types in each group when the data is ordered by both grouping field and time field.
A |
|
1 |
=file("T.ctx").open().cursor(gid,etime,event;…) |
2 |
=A1.group(gid).select(~.len()>=5) |
3 |
=A2.groups(~(1).event:e1,~(2).event:e2,~(3).event:e3,~(4).event:e4,~(5).event:e5;count(1):c) |
Compared with the foregoing code, this code can be quite simpler and (~.sort(etime)) can be omitted in A2 if the data are ordered by both grouping field and time field in advance.
3. The data are presorted and the ordered result set are stored.
A |
|
1 |
=file("T-original.ctx") |
2 |
=A1.open().cursor().sortx(gid,etime).new(gid,etime,…) |
3 |
=file("T.ctx").create(#gid,#etime,…) |
4 |
=A3.append@i(A2) |
A1: define the original composite table T-original.ctx.
A2: open the composite table T-original.ctx, create a cursor based on it, sort the cursor by field gid and field time1 where field gid and field time1 are in the leading positions. The sortx can be written as sortx(gid) if the cursor is only sorted by gid.
A3: create a new composite table T.ctx with pound sign # preceding the field name, which means this composite table is ordered by field gid and field time1. The create can be written as create(#gid,time1,…) if the table is only sorted by gid.
A4: write the ordered data into the composite table T.ctx.
4. Append the newly-added data
Suppose the daily newly-added data for table T are stored in T_new.btx, and have the same field names in the same order as T.ctx have.
A |
B |
|
1 |
=file("T.ctx").open() |
|
2 |
=file("T_new.btx").cursor@b().sortx(gid,etime) |
|
3 |
if (day(now())==1) |
>A1.reset() |
4 |
>A1.append@a(A2) |
A1: open the composite table T.ctx.
A2: define a cursor based on T_new.btx and sort it. As the daily volume of new data is small, the sorting is a fast and in-memory operation though sortx function is used. The time1 in sortx should be removed if the cursor is only ordered by gid.
A3: identify whether the current date is day 1: if not, execute A4 and use append@a to merge the record with only the patch data; if so, execute B3 and use reset function to merge the existing ordered data with the patch data to generate a new ordered data table.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version