User Behavior Analysis in Practice 14: Real-time T+0 Analysis
Target task
We have a user events table T. Below is its structure and part of its data:
Time |
UserID |
EventType |
OS |
Browser |
ProductID |
… |
f1 |
f2 |
f3 |
f4 |
f5 |
… |
2022/6/1 10:20 |
1072755 |
Search |
Android |
IE |
100001 |
… |
true |
false |
false |
true |
false |
… |
2022/6/1 12:12 |
1078030 |
Browse |
IOS |
Safari |
100002 |
… |
false |
false |
true |
true |
true |
… |
2022/6/1 12:36 |
1005093 |
Submit |
Android |
Chrome |
100003 |
… |
true |
true |
true |
false |
false |
… |
2022/6/1 13:21 |
1048655 |
Login |
Windows |
Chrome |
… |
false |
false |
true |
true |
true |
… |
|
2022/6/1 14:46 |
1037824 |
Logout |
Android |
Edge |
… |
false |
false |
false |
true |
true |
… |
|
2022/6/1 15:19 |
1049626 |
AddtoCart |
Windows |
Edge |
100004 |
… |
true |
true |
false |
true |
false |
… |
2022/6/1 16:00 |
1009296 |
Submit |
IOS |
Firefox |
100005 |
… |
false |
true |
false |
false |
true |
… |
2022/6/1 16:39 |
1070713 |
Browse |
IOS |
Sogou |
100006 |
… |
true |
true |
true |
false |
false |
… |
2022/6/1 17:40 |
1090884 |
Search |
Windows |
IE |
100007 |
… |
true |
false |
true |
true |
false |
… |
Fields in table T:
Field name |
Data type |
Description |
Time |
Datetime |
Time stamp of an event, accurate to milliseconds |
UserID |
String |
User ID |
EventType |
String |
Event type, whose value is Login, Browse, Search, AddtoCart, Submit or Logout |
OS |
String |
Operating system, whose value is Android, IOS, Windows or Unknown |
Browser |
String |
Browser, whose value is IE, Safari, Edge, Firefox, Chrome, Sogou or Unknown |
ProductID |
String |
Product ID, whose value is the ProductID field of dimension table Product |
… |
String |
Other fields that have enumerated values |
f1 |
Boolean |
Whether it is an offsite event or not; value is true or false |
f2 |
Boolean |
Whether it is a usual device or not; value is true or false |
f3 |
Boolean |
Whether it is a usual browser or not; value is true or false |
f4 |
Boolean |
Whether it is a cell phone or not; value is true or false |
f5 |
Boolean |
Whether it is the first operation; value is true or false |
… |
Boolean |
Other fields that have Boolean values |
Dimension table Product:
ProductID |
ProductName |
Unit |
Price |
ProductType |
100001 |
Apple |
Pound |
5.5 |
Fruits |
100002 |
Tissue |
Packs |
16 |
Home&Personalcare |
100003 |
Beef |
Pound |
35 |
Meat |
100004 |
Wine |
Bottles |
120 |
Beverage |
100005 |
Pork |
Pound |
25 |
Meat |
100006 |
Bread |
Packs |
10 |
Bakery |
100007 |
Juice |
Bottles |
6 |
Beverage |
… |
… |
… |
… |
… |
Fields in dimension table Product:
Field name |
Data type |
Description |
ProductID |
String |
Product ID |
ProductName |
String |
Product name |
Unit |
String |
Sales unit |
Price |
Numeric |
Unit price |
ProductType |
Integer |
Product type |
Computing task:
Count users who are not newcomers on a local Android or IOS system using Safari, Edge or Chrome and who perform the first N of a series of events (search, add to cart and submit order) in order under the product type Home & Personal care in the past three months so that we can calculate the customer conversion rate and churn rate. Such a computing scenario is known as conversion funnel analysis.
Note that the time window is the recent three months and the latest data should be included. The other aspects are same as the circumstances in the conversion funnel analysis:
1. The three events should occur in the time order. Those that do not appear in order are not eligible.
2. The three events should happen under one user in a specified time window. Those that occur out of the time range are not included.
3. Begin timing at the occurrence of the first event. If the subsequent events occur in order within the time window, each event is recorded as 1; otherwise, it is recorded as 0. If the frequency of an event is 0, there is no need to scan the subsequent events.
Techniques involved
The historical data, according to explanations in the previous articles, can be exported to T.ctx. And the latest generated real-time data can be appended to the in-memory patch zone through append@y() function to engage in the real-time analysis and computation. The new data won’t be written to the composite table file and thus won’t disturb the regular data store and appen to the composite table file.
All pseudo tables, composite tables and multizone composite tables support append@y() function.
Sample code
1. As we do in the previous articles, we store historical data in composite table file T.ctx. The data is stored in different zone tables by yearmonth to form a multizone composite table.
2. Define a pseudo table based on the multizone composite table using the same code provided in the previous article.
A |
|
1 |
=to(2021,2022).conj((a=~*100,12.(~+a))) |
2 |
=T("Product.btx").keys@i(ProductID) |
3 |
=[{file:"T.ctx", zone:A1, user:"UserID", date:"Time", column:[ {name:"Month",exp:"month@y(Time)"}, {name:"EventType",pseudo:"EventTypeName",enum:["Login","Browse","Search","AddtoCart","Submit","Logout"]}, {name:"OS",pseudo:"OSName",enum:["Android","IOS","Windows","Unknown"]}, {name:"Browser",pseudo:"BrowserName",enum:["IE","Safari","Edge","Firefox","Chrome","Sogou","Unknown"]}, {name:"b1",bits:["f1","f2","f3","f4","f5"]}, {name:"ProductID",dim:A2}] }] |
4 |
=pseudo(A5) |
3. Retrieve the newly-increased, real-time data, sort it by UserID and Time and append it to the multizone composite table’s patch zone using append@y() function.
Suppose in each day the newly-generated data in the previous day is appended to the multizone composite table file after 24:00. In this case. the real-time, new data is the data generated after 00:00 in each morning.
A |
|
… |
/ The above pseudo table definition code |
5 |
=connect("demo").cursor@x("select * from T where Time>=? order by UserID,Time",date(now()-1)) |
6 |
=A4.append@y(A7) |
A5 Connect to the database, retrieve the newly-generated data from table T to generate a cursor while sorting the cursor data by UserID and Time.
A6 Append data in the cursor to the multizone composite table’s in-memory patch zone through the pseudo table.
Beside using the pseudo table, we can also directly open the multizone composite table and append the new data to it using append@y(). To do this, we need to first convert each enumerated field to ordinal numbers and each binary field to bit-based dimension.
4. Summarize data using the pseudo table using the same code provided in the previous article. SPL will automatically merge the real-time data stored in the memory and the historical data held in the composite table file and export the merged data in an appropriate way.
A |
B |
C |
D |
|
… |
/ The above pseudo table definition code |
|||
5 |
>start =elapse@m(now(),-3),tw=7 |
|||
6 |
[Search,AddtoCart,Submit] |
=A6.(0) |
||
7 |
=A4.cursor(UserID, EventType,Time;Time>=start && A6.contain(EventType) && ProductID.ProductType=="Home&Personalcare"&& ["Safari","Edge","Chrome"].pos(BrowserName) && ["Android","IOS"].pos(OSName) && ! f1 && f4 && !f5) |
|||
8 |
for A7;UserID |
=first=A8.select@1(EventType==A6(1)) |
||
9 |
if(B8==null) |
next |
||
10 |
=t=null |
=A6.(null) |
||
11 |
for A6 |
if #B11==1 |
>C10(1)=t=t1=first.Time |
|
12 |
else |
>C10(#B11)=t=if(t,A8.select@1(EventType==B11 && Time>t && Time<elapse(t1,tw)).Time,null) |
||
13 |
=C10.(if(~,1,0)) |
|||
14 |
>D6=D6++B13 |
|||
15 |
return D6 |
5. When the multizone composite table does not have a bi-dimension ordering structure, we handle data in it in the same way – append data to the composite table’s in-memory patch zone using append@y function – without changing the code.
Execution result:
Member |
393400 |
257539 |
83375 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version