"Target task: We have a user events table T. Below is its structure and part of its data: Time Us .."

pjf RaqForum 22 No.
1 Reply • 464 View • 2 Years ago

User Behavior Analysis in Practice 4: Using Column-wise Storage

Target task:

We have a user events table T. Below is its structure and part of its data:

Time	UserID	EventTypeID	EventType	Product	Quantity	Unit	Price	…
2022/6/1 10:20	1072755	3	Search	Apple	5	Pound	5.5	…
2022/6/1 12:12	1078030	2	Browse	Tissue	4	Packs	16	…
2022/6/1 12:36	1005093	5	Submit	Beef	3	Pound	35	…
2022/6/1 13:21	1048655	1	Login					…
2022/6/1 14:46	1037824	6	Logout					…
2022/6/1 15:19	1049626	4	AddtoCart	Wine	4	Bottles	120	…
2022/6/1 16:00	1009296	5	Submit	Pork	6	Pound	25	…
2022/6/1 16:39	1070713	2	Browse	Bread	3	Packs	10	…
2022/6/1 17:40	1090884	3	Search	Juice	6	Bottles	6	…

Fields in table T:

Field name	Data type	Description
Time	Datetime	Time stamp of an event, accurate to milliseconds
UserID	Integer	User ID
EventTypeID	Integer	Event type ID
EventType	String	Event type name
Product	String	Product name
Quantity	Numeric	Quantity
Unit	String	Unit
Price	Numeric	Price
……	……	……

Computing task:

Find the number of events under each type and that of distinct users who perform that type of event in the specified time period.

The first thing the task involves is to obtain data within a specified time period from huge amount of data.

Techniques involved:

1. Column-wise storage

The user events table has many fields but the analysis only involves a few of them. By using a composite table of column-wise storage format, we just need to read the involved fields. This can significantly speed up retrieval and reduce memory usage.

Generally, a table stored column-wise is also highly compressed, which helps to reduce disk access.

2. Order-based storage

This computing task involves retrieving and summarizing data within a specified time range. Using Time as the dimension of the composite table and storing data according to it can much increase the filtering speed.

A composite table stores data in blocks. Each block stores the maximum and minimum values of the dimension field in it (which we call minmax index). A filtering condition on the dimension field can quickly identify and skip the ineligible blocks.

The order-based storage can increase compression ratio further and then decrease disk data access.

3. Pre-filtering before retrieval with cursor

Generating data objects takes time. We can judge data according to the filtering condition before retrieval from the cursor in an effort to minimize the unnecessary data object generations.

Sample code

1. Dump data from database and store it in a composite table file of column-wise format and ordered by time.

Stocked data: the data is retrieved from the database, sorted by time and written to a composite table file of column-wise format.

	A
1	=connect("demo").cursor@x("select * from T order by Time")
2	=file("T.ctx").create@y(#Time,UserID,EventTypeID,EventType, Product, Quantity, Unit, Price, ……)
3	=A2.append(A1)
4	>A2.close()

A1 Sort table T by time while retrieving data from it.

A2 Generate composite table file T.ctx. create()creates the composite table file’s structure, where #Time means that Time field acts as the dimension and data is stored in its order. Generally, ordered dimensions should be written before all the other fields involved (there are cases where the dimension fields do not take the lead, but for the time being we take it for granted that they must head the fields). A composite table’s structure needs to be explicitly written at its creation, which is different from creating a bin file.

A3 Fetch A1’s data from the cursor and append it to composite table file T.ctx. Make sure that the to-be-appended data should be already ordered (it needs to be sorted in A1) as writing data to a composite table does not check the orderliness.

A4 Close A2’s composite table. This is necessary after finishing writing data to a composite table.

Newly-increased data: The newly-increased data is identified through time stamp. Each day after 0 o’clock we append the newly-generated data in the past day to a composite table file:

	A
1	=connect("demo").cursor@x("select * from T where Time>=? && Time<? order by Time",date(now()-1), date(now()))
2	=file("T.ctx").open()
3	=A2.append(A1)
4	>A2.close()

A1 Get data generated in the previous day through filtering condition to create a cursor. Data needs to be first sorted by time stamp.

A2 Open the composite table file.

A3 Fetch A1’s data from the cursor and append it to composite table file T.ctx. As the newly-increased data is identified according to time stamp, it is naturally generated later. There’s no need to sort-merge the new data and the old data. Just append the new data to the original file.

A4 Close the composite table.

2. Perform filtering on the composite table and then grouping & aggregation

Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:

	A
1	>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
2	=file("T.ctx").open().cursor(UserID,EventTypeID,EventType;Time>=start && Time<=end).groups(EventTypeID; EventType, count(1):Num, icount(UserID):iNum)

A2 Retrieve data within the specified time range from the composite table file and group and summarize it. As no data writing is involved here, the close() action on the composite table is not needed.

Since only a few fields participate in computation this case, they need to be listed in cursor function to avoid retrieving all fields. Note that there is no need to select fields engaging in only the filtering but not the subsequent grouping & aggregation.

The filtering condition should be written in cursor(). This will judge data according to the condition, decide whether or not to generate a result record (in this case the aforementioned block skipping action will be performed automatically through comparing with the dimension field) and skip the ineligible data. The code cursor().select(…) will first generate a result set and then filter it. In that case objects of ineligible data will also be generated.

Execution result:

EventTypeID	EventType	Num	iNum
1	Login	4033000	393400
2	Browse	3578901	348791
3	Search	2947931	257539
4	AddtoCart	1845674	175476
5	Submit	867345	83375
6	Logout	4033000	393400

3. Use parallel processing to further boost performance

	A
1	>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
2	=file("T.ctx").open().cursor@m(UserID,EventTypeID,EventType;Time>=start && Time<=end;2).groups(EventTypeID; EventType, count(1):Num, icount(UserID):iNum)

A2 uses cursor@m(...;...;n) to implement parallel processing; @m option enables generating a multicursor for performing multithreaded processing, and n represents the number of threads a cursor is split, which is generally less than the number of CPU cores. Use the number of parallel tasks in system configurations when parameter n is absent.

Execution result:

EventTypeID	EventType	Num	iNum
1	Login	4033000	393400
2	Browse	3578901	348791
3	Search	2947931	257539
4	AddtoCart	1845674	175476
5	Submit	867345	83375
6	Logout	4033000	393400

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

pjf • 464 View • 2 Years ago

User Behavior Analysis in Practice 4: Using Column-wise Storage

Target task:

Techniques involved:

Sample code

ToC