SQL Performance Enhancement: Conversion Funnel Analysis
[Abstract]
The essay analyzes the fundamental reason behind slow SQL statements and provides the solution to make improvements through code samples. Click SQL Performance Enhancement: Conversion Funnel Analysis for details.
Problem description
Conversion funnel analysis is an in-group time-series operation, which benefits many situations and the conversion analysis of e-commerce purchase is one of them. The user performs actions like page view, search, add to cart, order placing, payment and so on after he logs in the e-commerce website. These events are ordered in time and part of the users will be lost after each of them. Conversion funnel analysis usually counts the user number after each event firstly, then does complex calculations like conversion rate on the basis of the former.
Suppose the fields of table T include id (the event number), gid (the grouping field), etime(the time field), eventtype(the event type), and each of them respectively corresponds to the user event number, user number, action time, event type of the e-commerce purchase analysis. The gross data volume table T involving is usually quite large, which needs to be stored externally, while the data of each group (the user) are not that big, which can be hold in memory.
Here 3 steps of the solution to conversion funnel are illustrated, and other steps can be deduced by analogy. Take Oracle for example, and the SQL is as follows:
with e1 as (
select gid,1 as step1,min(etime) as t1
from T
where etime>= to_date('2021-01-10', 'yyyy-MM-dd') and etime<to_date('2021-01-25', 'yyyy-MM-dd') and eventtype='eventtype1' and …
group by 1
),
e2 as (
select gid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
from T as e2
inner join e1 on e2.gid = e1.gid
where e2.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e2.etime<to_date('2021-01-25', 'yyyy-MM-dd') and e2.etime > t1 and e2.etime < t1 + 7 and eventtype='eventtype2' and …
group by 1
),
e3 as (
select gid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
from T as e3
inner join e2 on e3.gid = e2.gid
where e3.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e3.etime<to_date('2021-01-25', 'yyyy-MM-dd') and e3.etime > t2 and e3.etime < t1 + 7 and eventtype='eventtype3' and …
group by 1
)
select
sum(step1) as step1,
sum(step2) as step2,
sum(step3) as step3
from
e1
left join e2 on e1.gid = e2.gid
left join e3 on e2.gid = e3.gid
Sub-query e1 firstly filters the data whose event type is eventtype1 in a specified period of time form table T. The starting and ending time of the specified period usually are the given parameters, which change dynamically. “and...” means to have the possibility to satisfy other conditions, and sub-queries e2 and e3 also need to satisfy the same condition. Then the data are grouped by gid, and the smallest etime of each group are retrieved as the time t1 of the first step of conversion funnel, and the value of new field step1 are defined as 1.
Sub-query e2 uses table T (also known as e2) to inner join with e1, and the join field is gid. It filters the data whose e2.etime is less than 7 days after t1 and event type is eventtype2 in the specified period. The 7-day here is called the funnel window period, which can also be the given parameter. Then the data are group by gid, the smallest e2.etime of each group are retrieved as the time t2 of the second step of conversion funnel, and the value of new field step2 are defined as 1.
Sub-query e3 uses T table (also known as e3) to inner join with e2, and the join field is gid. It filters the data whose e3.etime is larger than t2 as well as less than 7 days after t1 (funnel window period), and event type is eventtype3 in the specified period. Then the data are grouped by gid, the smallest e3.etime of each group are retrieved as the time t3 of the third step of conversion funnel, and the value of new field step3 are defined as 1.
At last, the database uses e1 to left join with e2 and e3, the join field is gid, and aggregates step1, step2, and step3 to get a sum.
Here are the 3 steps of funnel conversion analysis. If you want to accomplish more steps of funnel conversion analysis, please refer to e3 to write sub-queries like e4, e5, and so on, and main query needs to be added with thecorresponding main code of join and aggregation.
Generally speaking, the result set of grouping table T of a specific period by gid tends to be big. To perform big group requires buffering on external storage, which results in bad computation performance.
Solution
1. Presorting and order-based algorithm
We have to presort the original data orderly according to the grouping field in the process of data preparation. When performing conversion funnel analysis, the algorithm traverses the ordered data which satisfy the conditions like the time period, the event type and so on, reads the data of each group into memory in turn. Specifically, step 1: it sorts the current grouped data in time order; step 2: it finds the first record of the first event type in the current group, records the time as t1, and assigns t1 to the first member of the current grouping result set. If t1 cannot be found, then the current group is skipped; step 3: in the records of the second event type in the current group, it finds the first record whose time is after t1 and before t1+7 (the funnel window period), and assigns its time t2 to the second member of the result set (if t2 cannot be found, then the assignment will be null); step 4: in the records of the third event type in the current group, it finds the first record whose time is after t2 and before t2+7, and assigns its time t3 to the third member of the result set (if t2 is null or t3 cannot be found, then the assignment will be null); step 5: it aggregates the three members of the result set per group to get the final result. This method traverses the data only once and avoids performing grouping with a big result, thus improving the performance greatly.
We can also sort the original data by both grouping field and time field beforehand and spare the resorting of each group, which improves the performance less effectively, but simplifies the code obviously.
In fact, many grouping fields of in-group time-series calculation are certain (like the user ID and the account number) rather than randomly chosen. Sorting data by these certain fields can help many situations of in-group time-series calculation. Other in-group time-series calculations only differ in the specific algorithms, which will be introduced on their own.
Presorting is slow yet one-off and holds only one kind of storage without redundant data.
SQL, based on the unordered set, cannot ensure the data of each group are orderly stored, therefore, it cannot directly use the order-based algorithm.
2. New-added data
Newly-added data are not always ordered by the grouping field, so they should not be simply appended to the end of the already ordered data. And it will be very time-consuming to directly do an ordinary full sorting of the existing data and the new data together.
Instead, we can sort the newly-added data first and then, along with the already ordered data, perform low-cost order-based MERGE algorithm to generate a new ordered data table. In this way, the overall degree of complexity equals to reading and writing the whole data once, which avoids the frequent temporary external buffering in ordinary full sorting, thus improving the performance greatly.
Furthermore, we can keep a small-scale ordered table additionally (hereinafter referred to as patch data), which will merge with the newly-added data while keeping the old ordered data unchanged. The patch data will merge with the old ordered data until they accumulate to a suitable size after a while. When performing the time-series calculation in the group, the algorithm reads from the old ordered data and patch data respectively, merges them and then calculates. In this manner, the performance will be a bit slower compared with handling one table of ordered data, but the data orderliness can still make the computation much faster.
The time when to merge the patch data with the old ordered data is related to the cycles when data should be added. If new data are added every day, we can do the merge once a month. The patch data store data of one month at most and the existing ordered data contain all data one month ago. That is, the patch data may be far smaller than the old ordered data, so the data to be merged every day is relatively small and the data appending is fast. As we do the whole-data order-based merge only once a month, it is fairly acceptable that it takes a littler longer to complete.
Code example
1. Calculate the conversion funnel when the data is ordered by grouping field and unordered by time field.
A |
B |
|
1 |
=["eventtype1","eventtype2","eventtype3"] |
=file("T.ctx").open() |
2 |
=B1.cursor(gid,etime,eventtype;etime>=date("2021-01-10") && etime<date("2021-01-25") && A1.contain(eventtype) && …) |
|
3 |
=A2.group(gid).(~.sort(etime)) |
=A3.new(~.select@1(eventtype==A1(1)):first,~:all).select(first) |
4 |
=B3.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(eventtype==A1.~ && etime>t && etime<t1+7).etime, null)))) |
|
5 |
=A4.groups(;count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3) |
A1: define the three event types, which can also be given parameters.
B1: open the composite table T.ctx.
A2: create a cursor to filter the data satisfying the conditions like time period, event type and so on. These conditions can be given parameters.
A3: define the grouping operation, and sort each group by etime.
B3: create a new cursor, name the first data of the first event type in each group as “first” and the original group as “all”. Then define filtering to skip the data whose “first” field is null.
A4: define the calculation of B3. Loop through each record according to A1. The first loop assigns the time of “first”, which is the earliest time of the first event type, to variable t1, t and the first member of the result set at the same time. The second loop finds the first record whose eventtype is the second event type in “all” and etime is larger than t and smaller than t1+7. Its etime is assigned to t and the second member of the result set (if t cannot be found, the assignment will be null). The third loop finds the first record whose eventtype is the third event type in “all” and etime is larger than t and smaller than t1+7 if t is not null. Its etime is assigned to t and the third member of the result set (if t is null or cannot be found , the assignment will be null). Please note that the 7 days (funnel window period) here can also be given parameter.
A5: execute the previously defined calculation and aggregate the three members of each result set in each group to return a small result set and count the number.
Here are still 3 steps of conversion funnel analysis. To accomplish more steps, you can add the event types of following steps in the event type sequence of A1, for example, ["eventtype1","eventtype2","eventtype3","eventtype4","eventtype5"…]. The code can stay the same if the event type sequence is the given parameter. This method is much simpler compared with adding sub-queries and modifying the main query in SQL.
2. Calculate the conversion funnel when the data is ordered by both grouping field and time field.
The above code can be even simpler and (~.sort(etime)) can be omitted in A3 if the data are ordered by both grouping field and time field in advance.
3. The data are presorted and the ordered result set are stored.
A |
|
1 |
=file("T-original.ctx") |
2 |
=A1.open().cursor().sortx(gid,etime).new(gid,etime,…) |
3 |
=file("T.ctx").create(#gid,#etime,…) |
4 |
=A3.append@i(A2) |
A1: define the original composite table T-original.ctx.
A2: open the composite table T-original.ctx, create a cursor based on it, sort the cursor by field gid and field etime where field gid and field etime are in the leading positions. The sortx can be written as sortx(gid) if the cursor is only sorted by gid.
A3: create a new composite table T.ctx with pound sign # preceding the field name, which means this composite table is ordered by field gid and field etime. The create can be written as create(#gid,etime,…) if the table is only sorted by gid.
A4: write the ordered data into the composite table T.ctx.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version