Data Analysis Programming from SQL to SPL: User Event Statistics

 

The following is part of the data from the simplified user behavior table actions, which records the occurrence time of 9 types of events for each user:

..

1. Calculate the number of sessions for each user

A session is considered over if a user does not take action for more than 2 days, or does not take action within 2 hours after an e8 event.

The basic idea to solve the problem is to first group by user, then count the number of sessions when the session end condition is satisfied in each user’s event set, and finally add 1 to get the result (if there are n session ends, then the number of sessions is n+1).

SQL code:

with t1 as (
    select userid,etime,event,
       lag(event) over(partition by userid order by etime) event1,
       lag(etime) over (partition by userid order by etime) etime1
    from action)
select userid, sum(
    case when timestampdiff(second,etime1,etime)>172800
    or (event1='e8' and timestampdiff(second,etime1,etime)>7200) then 1
    else 0 end)+1 cnt
from t1 group by userid;

SQL sets are unordered; the members within them have no defined position, so they cannot be referenced by the position, and the window function lag is needed to calculate the previous event. If the condition involves the previous events event and etime, the extensive lag clause has to be written twice, which is quite cumbersome.

Due to the complexity of SQL, the values returned by window function cannot be directly used in the main query. Instead, they need to be pre-calculated in a subquery as event1 and etime1, which adds more trouble to solving the problem.

SQL cannot keep the grouped subsets; each step of calculation requires re-grouping all data. Here are a total of three grouping operations – two partition by and one group by - and the code is also quite complex to write.

In contrast, SPL sets are ordered, allowing for easy referencing of data by relative position. Additionally, SPL can keep grouped subsets, allowing for coding that follows a natural thinking:


A

1

=file("actions.txt").import@t().sort(etime)

2

=A1.group(userid ; ~.count(
interval@s(etime[-1],etime) >172800 ||
(event[-1]=="e8" && interval@s(etime[-1],etime)>7200)
)+ 1 : cnt)

A1: Read data and sort them by etime.

A2: Group by userid. The ~ in the group function represents the grouped subset, that is, the event set of the current user; its members are still ordered by etime. The data grouped by SPL will maintain the original order. SPL provides [] to implement data reference by relative position, where etime[-1] and event[-1] represent the previous row’s etime and event, making it very simple to determine the session end condition.

2. Calculate the day 2 retention rate of new users

Day 2 retention rate of new users = number of users who continued visiting on day 2 / number of new users on day 1.

The basic idea to solve the problem is to first find the active dates for each user (i.e., the dates when events occurred), then identify the earliest of these dates (which is the day the user first appeared as a new user), next, check if the next day is still in the users active dates (if it is, it indicates that the user is retained), and finally, counting produces the daily retention rate.

SQL code:

with t1 as (
    select userid, date(etime) edate from actions group by userid,date(etime)),
t2 as (
    select userid, edate, row_number() over(partition by userid order by edate) rn
    from t1
),
firstday as (
    select userid, min(edate) frst from t2 group by userid),
retention as (
    select fd.userid, frst, t.edate nxt
    from firstday fd left join t2 t
    on fd.userid=t.userid and date_add(fd.frst, interval 1 day)=t.edate
    group by fd.userid, frst, nxt
)
select frst edate, count(nxt)/count(frst) rate
from retention
group by edate
orde

The subquery t1 uses group by to find the active dates of each user, but the result set is unordered at this point. It also needs to use a window function in t2 to assign a sequence number rn to every event within each users range to obtain the date of the first event.

Then, it needs to calculate the date of the next day based on this date and check whether the calculated date is in the users active date set. Since SQL cannot keep the event subset for individual user and cannot perform calculations directly on these subsets, it has to change the approach to implement, that is, add 1 to the date of the first day and perform a left join with the entire date set to determine whether the user has been retained. This, however, makes the code more difficult to understand.

In contrast, SPL sets are ordered, and SPL supports operations on grouped subsets, allowing us to write code in full accordance with natural thinking:


A

1

=file("actions.txt").import@t()

2

=A1.groups(userid,date(etime):edate)

3

=A2.group(userid)

4

=A3.new(userid, edate:frst, ~.select@1(edate==frst+1).edate:nxt)

5

=A4.groups(frst ; count(nxt)/count(frst):rate)

The groups() function in A2 calculates the date while grouping by userid and edate. The grouped results aggregate the dates and are ordered, eliminating the need for manual sequence numbering:

..

SPL’s IDE is highly interactive, enabling step-by-step execution and easy visual inspection of the results of each step in the right-hand panel at any time. After grouping by userid in A3, multiple subsets are obtained. Select A3, its result data will appear on the right-hand side:

..

The group() function is simply for grouping, without enforcing binding of aggregate operations. Clicking on the member of the grouped result shows that it is still a set, that is, the date set of each user:

..

The new() function in A4 performs calculations on each user’s event set. In this function, the default value of edate is the first date, i.e., the first day; the ~ symbol represents the current grouped subset, in which it checks for the existence of the next day nxt. If nxt is not found, it returns null, indicating that the user has not been retained:

..

Finally, it is just a matter of simply counting and calculating the ratio.

Once you are familiar with SPL, these steps can be simply written in one statement:


A

1

=file("actions.txt").import@t()

2

=A1.groups(userid,date(etime):edate).group(userid)

.new(userid, edate:frst, ~.select@1(edate==frst+1).edate:nxt)

.groups(frst ; count(nxt)/count(frst):rate)

Next article for more demo:《Data Analysis Programming from SQL to SPL: User Event Statistics-2》

【Download Test Data】