Data Analysis Programming from SQL to SPL: User Event Statistics - Continued
This article is a continuation of Data Analysis Programming from SQL to SPL: User Event Statistics
Continue to use the user behavior table actions, which records the occurrence time of 9 types of events for each user:
3. Search for users who ranked in the top 500 for activity for four consecutive days
The more events a user has per day, the more active they are considered to be. Therefore, the activity ranking is the ranking of the number of events.
First, group and aggregate the number of events by date and user. Then, find the top 500 users with the most events each day. Next, calculate the intersection for every 4 consecutive days, that is, the users whose activity is in the top 500 for 4 consecutive days. Finally, calculate the union of these users and de-duplicate.
Let’s look at the SQL code first:
with t1 as (
select date(edate) edate, userid, count(*) cnt
from actions
group by date(edate), userid),
t2 as (
select edate, userid, rank() over(partition by edate order by cnt desc) rn from t1),
t3 as (
select * from t2 where rn<=500),
t4 as (
select a2.edate,a2.userid from t3 a1
join t3 a2 on a1.userid=a2.userid and a1.edate=a2.edate-interval 1 day
join t3 a3 on a1.userid=a3.userid and a1.edate=a3.edate-interval 2 day
join t3 a4 on a1.userid=a4.userid and a1.edate=a4.edate-interval 3 day
)
select distinct userid from t4 order by userid;
SQL, when performing grouping, cannot keep the grouped subsets and cannot directly rank users of each day. To calculate the daily ranking, SQL has to use the window function to perform a second grouping (partition by) by edate on the subquery (t1) that has already been grouped and aggregated, resulting in repetitive grouping syntax.
Moreover, SQL window function can only operate on the result set of a query. This forces the filtering of the top 500 and the ranking to be done separately, not in a single SQL statement, thus requiring a split into two subqueries (t2, t3).
Due to the unordered nature of SQL sets, it is not possible to simply reference 4 consecutive days by position. To address this, a self-JOIN operation is required to join 4 consecutive days for calculation, which is not a very intuitive approach.
In contrast, SPL can keep grouped subsets and can also reference set members by position, making it easy to write code using natural thinking:
A |
|
1 |
=file("actions.txt").import@t() |
2 |
=A1.group(date(etime)).( ~.groups(userid;count(1):cnt).top@r(-500;cnt) .(userid) ) |
3 |
=A2.(if(#>3,~[-3:0].isect())) .conj() .id() |
A2: Group by date to obtain an event subset for each day. For each day’s event subset ~, group and aggregate the number of events cnt for each user. Use the top() function to obtain the top 500 users with the largest number of events.
A3: Perform calculations on each day’s data. The ~ here represents the current day, so ~[-3:0] is the 4 consecutive days from the previous three days to the current day. Use the isect() function to calculate the intersection of the users in the four days (the top 500 users), then use conj() to find the union, and finally use id() to remove duplicates.
Most procedural programming languages have weak set operation capabilities, whereas SQL, despite having some, offers weak support for procedural logic. SPL addresses the shortcomings of both; its procedural steps are very clear, with a single function typically implementing a single calculation step, and it also provides powerful set operations, especially for ordered sets.
4. Calculate the number of users who have been active for 3 consecutive days or more in the last 7 days (including the current day) every day
A user is considered active if they have at least one event within a day.
First, calculate the target date sequence using the earliest and latest dates. Then, match each user’s events with the dates in the sequence, with any unmatched dates designated as null. Next, determine whether each user has been active for 3 consecutive days within 7 days preceding each day. Finally, count the number of users meeting these criteria for each day.
with recursive edates as (
select min(date(etime)) edate from actions
union all
select date_add(edate, interval 1 day) from edates
where edate<(select max(date(etime)) from actions)
),
users as (
select distinct userid from actions
),
crox as (
select u.userid, d.edate, t.edate rdate
from edates d cross join users u
left join (select distinct userid, date(etime) edate from actions) t
on d.edate=t.edate and u.userid=t.userid
),
crox1 as (
select userid,edate, rdate, row_number() over(partition by userid order by edate) rn,
case when rdate is null or
(lag(rdate) over(partition by userid order by edate) is null and rdate is not null)
then 1 else 0 end f
from crox
),
crox2 as (
select userid, edate, rdate,
cast(rn as decimal) rn, sum(f) over(partition by userid order by edate) g
from crox1
),
crox3 as (
select t1.userid, t1.edate, t2.g, case when count(*)>=3 then 1 else 0 end active3
from crox2 t1 join crox2 t2
on t1.userid=t2.userid and t2.rn between t1.rn-6 and t1.rn
group by t1.userid,t1.edate,t2.g
),
crox4 as (
select userid, edate, max(active3) active3
from crox3
group by userid,edate
)
select edate, sum(active3) active3
from crox4
group by edate;
This SQL statement is very complex, incorporating many subqueries and employing recursive syntax, which makes it difficult to understand and more difficult to write. Here, the explanation will not be provided. You can try it yourself if you're interested.
SPL still allows coding according to natural thinking:
A |
|
1 |
=file("actions.txt").import@t() |
2 |
=periods(date(A1.min(etime)), date(A1.max(etime))) |
3 |
=A1.group(userid).(~.align(A2,date(etime))) |
4 |
=A3.(~.(~[-6:0].pselect(~&&~[-1]&&~[-2]))) |
5 |
=A2.new(~:date, A4.count(~(A2.#)):count) |
A2: Generate the target date sequence using the earliest and latest dates;
A3: Group by user, and then use the align function to align each user’s event with each day of the date sequence in A2 to form a daily activity status sequence for each user:
Clicking on a certain member, you can see a sequence like this:
If a user has a corresponding event on a given day, they are considered active. For example, user 1 is active on 6 out of the last 10 days, with days having no corresponding events marked as null.
A4: Continue to process each user’s activity status sequence. For each day, usepselect()to determine if there exists a day where that day and its previous 2 days (i.e., 3 consecutive days) are all active (status is not null) within the 7-day activity status sequence (~[-6:0]), which spans the current day and the 6 prior days. If found, return that day’s position within the 7-day sequence; otherwise, return null.This can determine if there are 3 consecutive active days within the 7-day period. The return value of pselect() forms a status sequence describing whether the user has 3 active days within every 7-day period.
A5: As long as the number of times that the member at the corresponding positions in the status sequence of all users in A4 (i.e., a user’s status on a given day) is not null is counted for each day, you can obtain the number of users who were active for at least 3 days within the 7-day period encompassing that day and the 6 preceding days:
As computing requirements become more complex, the difficulty of writing SQL code does not change linearly. Factors such as the inability to define operations on grouped subsets, unordered sets, and the need to find roundabout alternatives for complex cross-row operations will exacerbate the difficulty. SPL, in contrast, allows for new computation steps to be added gradually following a natural thought process, simplifies debugging, and can considerably increase coding efficiency.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese Version