User Behavior Analysis in Practice 10: Ordered Storage by Account
Target task
We have a user events table T. Below is its structure and part of its data:
Time |
UserID |
EventType |
… |
2022/6/1 10:20 |
1072755 |
Search |
… |
2022/6/1 12:12 |
1078030 |
Browse |
… |
2022/6/1 12:36 |
1005093 |
Submit |
… |
2022/6/1 13:21 |
1048655 |
Login |
… |
2022/6/1 14:46 |
1037824 |
Logout |
… |
2022/6/1 15:19 |
1049626 |
AddtoCart |
… |
2022/6/1 16:00 |
1009296 |
Submit |
… |
2022/6/1 16:39 |
1070713 |
Browse |
… |
2022/6/1 17:40 |
1090884 |
Search |
… |
… |
… |
… |
… |
Fields in table T:
Field name |
Data type |
Description |
Time |
Datetime |
Time stamp of an event, accurate to milliseconds |
UserID |
String |
User ID |
EventType |
String |
Event type |
… |
… |
… |
Computing task:
Find the number of occurrences and the distinct users under each type of event within a specified time period.
The key point is that the number of distinct users is too large and they cannot fit into the memory.
Techniques involved:
Find detailed explanation about order-based distinct/increment in SQL Performance Enhancement: DISTINCT & COUNT(DISTINCT) on Big Data and Ordered Storage of SPL.
1. Sort data by account and perform a distinct operation
Sort table T by user ID and save it as ordered data. By this we can achieve a fast distinct operation. To perform distinct count on an ordered field, we just compare each record with the previous one during the traversal without the need to retain result set in the memory. It is fast and won’t cause memory overflow.
2. External sorting
When the volume of data is too large to be wholly loaded into the memory, sorting data in the database is slow. An alternative is to read data into a bin file, sort it in esProc and save the result set as a composite table file.
3. Order-based increment
Usually the newly-generated data isn’t ordered by the account field. It cannot be directly appended to the original ordered data. On the other hand, it is time-consuming to combine the new data and the old ordered data and sort them again.
The composite table’s patch table is SPL’s solution. We retain a small-scale ordered data, which is called patch table, and merge the sorted newly-generated data with the patch table while keeping the original composite table unchanged. After a certain period of time when the patch table grows to a certain size, we merge it with the original composite table. To perform a distinct operation, we need to retrieve data from both the original composite table and the patch table, merge and traverse them. The performance is a little lower than that with only one ordered table, but the orderliness can still help achieve a fast distinct operation.
When to merge the patch table and the original composite table is relevant to how long data is updated. If there is newly-generated data every day, we can perform the merge every month. During the period of one month, the patch table contains data within the past month and the original composite table has all data before the last month. The former is much smaller than the latter. This means the amount of data to be merged each day is small and the data appending is fast. The order-based merge of the whole data only occurs once every month. And this makes it not very unacceptable even if it takes relatively long to finish the merge.
4. Parallel processing
Specify to segment data by UserID when creating a composite table so that records having same UserID values will be put into same segment. Then generate a multicursor based on the composite table and use parallel processing to summarize data faster.
Sample code
1. Basic method of sorting table T by UserID and dumping it:
A |
|
1 |
=connect("demo").cursor@x("select * from T order by UserID,Time") |
2 |
=file("T.ctx").create@p(#UserID,#Time,EventType,…) |
3 |
=A2.append(A1) |
4 |
>A2.close() |
A1 Import the fact table while sorting it by UserID and Time.
A2 Create composite table file’s structure. @p option enables segmenting the composite table by the first field (UserID). This means that records having same UserID will not be given to different threads during parallel processing and ensures a correct result. Without this option, the table will simply be segmented according to the number of records and it is probably that records having same UserID will be allocated to two threads. This could cause errors in performing an order-based distinct aggregate. As we do not use the order-based distinct aggregate algorithm in the previous computations, there aren’t special segmentation requirements and @p option is not needed to get correct parallel computations.
A3 Export and append data of the fact table to A2’s composite table file.
2. External sorting
If the volume of data is too large to fit into the memory and hard disk buffer is needed to do the sorting, database sorting will be extremely slow. A better alternative is to export data to a bin file, sort it with esProc and save the result as a composite table file, as the following code shows:
A |
|
1 |
=connect("demo").cursor@x("select * from T") |
2 |
=file("T.btx").export@b(A1) |
3 |
=file("T.ctx").create@p(#UserID,#Time,EventType,…) |
4 |
=file("T.btx").cursor@b(UserID,Time,EventType,…).sortx(UserID,Time;1000000) |
5 |
=A3.append(A4) |
6 |
>A3.close() |
A1 Import the fact table without sorting it.
A2 Export data in the fact table to a bin file.
A3 Create composite table structure.
A4 Import data in the bin file as a cursor and sort data by UserID and Time. Specify the number of buffer rows by parameter n in sortx(…;n) function. If the number of buffer rows is not definite, the parameter can be absent but sortx function will automatically calculates a relative suitable number. It is generally recommended that the buffer size does not exceed half of the memory space because more memory usage will slow down the Java virtual machine. When the memory is sufficiently large, the larger the buffer size the faster the sorting operation.
A5 Append the sorted data to composite table file.
3. Order-based increment
The increased data: We use the time stamp to identify the increased data. Each day after 24 o’clock the increased data will be appended to the composite table file:
A |
B |
|
1 |
=connect("demo").cursor@x("select * from T where Time>=? && Time<? order by UserID,Time",date(now()-1), date(now())) |
|
2 |
=file("T.ctx").open() |
|
3 |
=A2.append@a(A1) |
|
4 |
if (day(now())==1 |
>A2.reset() |
5 |
>A2.close() |
A1 Import the newly-increased data from the fact table. As the increment is small, we sort the increased data directly by UserID and Time.
A2 Open the composite table file.
A3 Export the increased data to the composite table file’s patch table by appending. @a option ensures writing data to patched table by appending.
A4-B4 Check whether the current date is the first day of the month. Merge the composite table and the patch table if it is, and then empty the patch table using reset() function.
Since @p option is used at the creation of composite table, SPL will automatically put the newly-appended increased data into corresponding segment so that no error occurs in the subsequent parallel computation.
4. Count distinct users based on the composite table ordered by UserID
A |
|
1 |
>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
2 |
=file("T.ctx").open().cursor@m(UserID, EventType;Time>=start && Time<=end;2).groups(EventType; count(1):Num, icount@o(UserID):iNum) |
A2 Filter table T in the cursor and perform grouping & aggregation. In icount@o(), the option means data is ordered by the distinct field. @p option is used in the data preparation phase, so here the @m option is enough to implement the parallel computation.
The unsolved problem: Before we perform a distinct count, data also needs to be filtered by Time field. Yet data has been first been sorted by UserID in order to do the distinct count where Time isn’t the first field by which data is ordered. In this case the filtering on Time cannot be executed fast but can only performed on the ordered data within each UserID. The data volume in each UserID, however, is very small, generally less than one data block. The skip-block retrieval is seldom. The effect is equivalent to traversing all data. With scenarios where the time span of total data is large but that of the to-be-queried data is small, the above approach has an unsatisfactory performance. We will introduce the SPL solution that can both address the distinct count on UserID and the filtering operation on Time in another essay (which focuses on bi-dimension ordering structure).
Execution result:
EventType |
Num |
iNum |
AddtoCart |
1845674 |
175476 |
Browse |
3578901 |
348791 |
Login |
4033000 |
393400 |
Logout |
4033000 |
393400 |
Search |
2947931 |
257539 |
Submit |
867345 |
83375 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version