User Behavior Analysis in Practice 1: Conventional Grouping and Aggregation
Target task
We have a user events table T. Below is its structure and part of its data:
Time |
UserID |
EventType |
2022/6/1 10:20 |
1072755 |
Search |
2022/6/1 12:12 |
1078030 |
Browse |
2022/6/1 12:36 |
1005093 |
Submit |
2022/6/1 13:21 |
1048655 |
Login |
2022/6/1 14:46 |
1037824 |
Logout |
2022/6/1 15:19 |
1049626 |
AddtoCart |
2022/6/1 16:00 |
1009296 |
Submit |
2022/6/1 16:39 |
1070713 |
Browse |
2022/6/1 17:40 |
1090884 |
Search |
Fields in table T:
Field name |
Data type |
Description |
Time |
Datetime |
Time stamp of an event, accurate to milliseconds |
UserID |
Integer |
User ID |
EventType |
String |
Types of events |
Computing task:
Find the number of events under each type and that of distinct users who perform that type of event in the specified time period.
Techniques involved:
1. Use binary file storage instead of database storage.
2. Use parallel processing during traversal.
Sample code
1. Dump data from database and store it in a binary file
Stocked data: the data is retrieved from the database and written to a bin file:
A |
|
1 |
=connect("demo").cursor@x("select * from T") |
2 |
=file("T.btx").export@b(A1) |
A1 Connect to the database, retrieve data from table T and generate a cursor. @x option enables to automatically close database connection after data retrieval finishes.
A2 Export A1’s data to bin file T.btx. @b option enables writing data to a binary file.
Newly-increased data: the newly-increased data can be obtained using a filtering condition in SQL when there is any that needs to be appended to an existing bin file. @a option enables appending data to a bin file.
The newly-increased data can be identified through time stamp. Each day after 0 o’clock we append the newly-generated data in the past day to a bin file:
A |
|
1 |
=connect("demo").cursor@x("select * from T where Time>=? && Time<?",date(now()-1), date(now())) |
2 |
=file("T.btx").export@ba(A1) |
A1 Get data generated in the previous day through filtering condition and store it in a cursor.
A2 Fetch A1’s data from the cursor and append it to bin file T.btx. @a enables data appending; without it the existing bin file will be overwritten.
2. Perform grouping and aggregation on a bin file
Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:
A |
|
1 |
=file("T.btx").cursor@mb() |
2 |
>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
3 |
=A1.select(Time<=end && Time>=start).groups(EventType; count(1):Num, icount(UserID):iNum) |
A1 Generate cursor for bin file "T.btx". @m enables parallel processing through a multicursor; set the default number of parallel threads in esProc configuration file raqsoftconfig.xml, or just write f.cursor@m(n) where n represents the number of parallel threads. It is recommended that n should be less than the number of computer’s CPU cores, otherwise speed will be lower.
A2 Generate two variables – start and end – to filter data by the time stamp. The variables will be passed in through parameters in real-life computations.
A3 Perform filtering and grouping & aggregation on A1’s cursor. The grouping field is EventType; count(1) performs a simple count while icount(UserID) finds the number of unique UserIDs. Note that no matter how many operations are performed on a cursor, like multiple rounds of filtering, sorting or grouping, they will be executed together at one time – data is retrieved once through one traversal – to get the final result.
Execution result:
EventType |
Num |
iNum |
AddtoCart |
1845674 |
175476 |
Browse |
3578901 |
348791 |
Login |
4033000 |
393400 |
Logout |
4033000 |
393400 |
Search |
2947931 |
257539 |
Submit |
867345 |
83375 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version