User Behavior Analysis in Practice 4: Using Column-wise Storage

 

Target task:

We have a user events table T. Below is its structure and part of its data:

Time

UserID

EventTypeID

EventType

Product

Quantity

Unit

Price

2022/6/1 10:20

1072755

3

Search

Apple

5

Pound

5.5

2022/6/1 12:12

1078030

2

Browse

Tissue

4

Packs

16

2022/6/1 12:36

1005093

5

Submit

Beef

3

Pound

35

2022/6/1 13:21

1048655

1

Login





2022/6/1 14:46

1037824

6

Logout





2022/6/1 15:19

1049626

4

AddtoCart

Wine

4

Bottles

120

2022/6/1 16:00

1009296

5

Submit

Pork

6

Pound

25

2022/6/1 16:39

1070713

2

Browse

Bread

3

Packs

10

2022/6/1 17:40

1090884

3

Search

Juice

6

Bottles

6

Fields in table T:

Field name

Data type

Description

Time

Datetime

Time stamp of an event, accurate to milliseconds

UserID

Integer

User ID

EventTypeID

Integer

Event type ID

EventType

String

Event type name

Product

String

Product name

Quantity

Numeric

Quantity

Unit

String

Unit

Price

Numeric

Price

……

……

……

Computing task:

Find the number of events under each type and that of distinct users who perform that type of event in the specified time period.

The first thing the task involves is to obtain data within a specified time period from huge amount of data.

Techniques involved:

1. Column-wise storage

The user events table has many fields but the analysis only involves a few of them. By using a composite table of column-wise storage format, we just need to read the involved fields. This can significantly speed up retrieval and reduce memory usage.

Generally, a table stored column-wise is also highly compressed, which helps to reduce disk access.

2. Order-based storage

This computing task involves retrieving and summarizing data within a specified time range. Using Time as the dimension of the composite table and storing data according to it can much increase the filtering speed.

A composite table stores data in blocks. Each block stores the maximum and minimum values of the dimension field in it (which we call minmax index). A filtering condition on the dimension field can quickly identify and skip the ineligible blocks.

The order-based storage can increase compression ratio further and then decrease disk data access.

3. Pre-filtering before retrieval with cursor

Generating data objects takes time. We can judge data according to the filtering condition before retrieval from the cursor in an effort to minimize the unnecessary data object generations.

Sample code

1. Dump data from database and store it in a composite table file of column-wise format and ordered by time.

Stocked data: the data is retrieved from the database, sorted by time and written to a composite table file of column-wise format.


A

1

=connect("demo").cursor@x("select * from T order by Time")

2

=file("T.ctx").create@y(#Time,UserID,EventTypeID,EventType, Product, Quantity, Unit, Price, ……)

3

=A2.append(A1)

4

>A2.close()

A1 Sort table T by time while retrieving data from it.

A2 Generate composite table file T.ctx. create()creates the composite table file’s structure, where #Time means that Time field acts as the dimension and data is stored in its order. Generally, ordered dimensions should be written before all the other fields involved (there are cases where the dimension fields do not take the lead, but for the time being we take it for granted that they must head the fields). A composite table’s structure needs to be explicitly written at its creation, which is different from creating a bin file.

A3 Fetch A1’s data from the cursor and append it to composite table file T.ctx. Make sure that the to-be-appended data should be already ordered (it needs to be sorted in A1) as writing data to a composite table does not check the orderliness.

A4 Close A2’s composite table. This is necessary after finishing writing data to a composite table.

Newly-increased data: The newly-increased data is identified through time stamp. Each day after 0 o’clock we append the newly-generated data in the past day to a composite table file:


A

1

=connect("demo").cursor@x("select * from T where Time>=? && Time<? order by Time",date(now()-1), date(now()))

2

=file("T.ctx").open()

3

=A2.append(A1)

4

>A2.close()

A1 Get data generated in the previous day through filtering condition to create a cursor. Data needs to be first sorted by time stamp.

A2 Open the composite table file.

A3 Fetch A1’s data from the cursor and append it to composite table file T.ctx. As the newly-increased data is identified according to time stamp, it is naturally generated later. There’s no need to sort-merge the new data and the old data. Just append the new data to the original file.

A4 Close the composite table.

2. Perform filtering on the composite table and then grouping & aggregation

Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:


A

1

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

2

=file("T.ctx").open().cursor(UserID,EventTypeID,EventType;Time>=start && Time<=end).groups(EventTypeID; EventType, count(1):Num, icount(UserID):iNum)

A2 Retrieve data within the specified time range from the composite table file and group and summarize it. As no data writing is involved here, the close() action on the composite table is not needed.

Since only a few fields participate in computation this case, they need to be listed in cursor function to avoid retrieving all fields. Note that there is no need to select fields engaging in only the filtering but not the subsequent grouping & aggregation.

The filtering condition should be written in cursor(). This will judge data according to the condition, decide whether or not to generate a result record (in this case the aforementioned block skipping action will be performed automatically through comparing with the dimension field) and skip the ineligible data. The code cursor().select(…) will first generate a result set and then filter it. In that case objects of ineligible data will also be generated.

Execution result:

EventTypeID

EventType

Num

iNum

1

Login

4033000

393400

2

Browse

3578901

348791

3

Search

2947931

257539

4

AddtoCart

1845674

175476

5

Submit

867345

83375

6

Logout

4033000

393400

3. Use parallel processing to further boost performance


A

1

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

2

=file("T.ctx").open().cursor@m(UserID,EventTypeID,EventType;Time>=start && Time<=end;2).groups(EventTypeID; EventType, count(1):Num, icount(UserID):iNum)

A2 uses cursor@m(...;...;n) to implement parallel processing; @m option enables generating a multicursor for performing multithreaded processing, and n represents the number of threads a cursor is split, which is generally less than the number of CPU cores. Use the number of parallel tasks in system configurations when parameter n is absent.

Execution result:

EventTypeID

EventType

Num

iNum

1

Login

4033000

393400

2

Browse

3578901

348791

3

Search

2947931

257539

4

AddtoCart

1845674

175476

5

Submit

867345

83375

6

Logout

4033000

393400