User Behavior Analysis in Practice 5: Using Dimension Table

 

Target task:

We have a user events table T. Below is its structure and part of its data:

Time

UserID

EventTypeID

ProductID

Quantity

2022/6/1 10:20

1072755

3

100001


2022/6/1 12:12

1078030

2

100002


2022/6/1 12:36

1005093

5

100003

3

2022/6/1 13:21

1048655

1



2022/6/1 14:46

1037824

6



2022/6/1 15:19

1049626

4

100004

4

2022/6/1 16:00

1009296

5

100005

6

2022/6/1 16:39

1070713

2

100006


2022/6/1 17:40

1090884

3

100007


Fields in table T:

Field name

Data type

Description

Time

Datetime

Time stamp of an event, accurate to milliseconds

UserID

String

User ID

EventTypeID

Integer

Event type ID

ProductID

String

Product ID

Quantity

Numeric

Quantity

Dimension table EventType:

EventTypeID

EventType

1

Login

2

Browse

3

Search

4

AddtoCart

5

Submit

6

Logout

Dimension table Product:

ProductID

ProductName

Unit

Price

ProductTypeID

100001

Apple

Pound

5.5

1

100002

Tissue

Packs

16

2

100003

Beef

Pound

35

3

100004

Wine

Bottles

120

4

100005

Pork

Pound

25

3

100006

Bread

Packs

10

5

100007

Juice

Bottles

6

4

Fields in dimension table Product:

Field name

Data type

Description

ProductID

String

Product ID

ProductName

String

Product name

Unit

String

Sales unit

Price

Numeric

Unit price

ProductTypeID

Integer

Product type ID

Dimension table ProductType:

ProductTypeID

ProductType

1

Fruits

2

Home&Personalcare

3

Meat

4

Beverage

5

Bakery

Relationship between tables:

undefined

Computing task:

Calculate the total sales amount, number of orders, search frequency and the number of distinct users performing search and ordering under each type of product within a specified time period.

Techniques involved:

1 Join tables through dimension table rather than generating a wide table. This can reduce the volume of data to be stored and increase retrieval speed.

2 Use a global variable to pre-load dimension tables and establish associations for later reuse.

Sample code

1. According to our previous practices, we dump data from user events table T and store it in composite table T.ctx according to the order of Time field; and then dump data in those dimension tables as bin files EventType.btx, Product.btx and ProductType.btx.

2. Import each dimension table into memory, set primary key for them, open the composite table cursor, establish associations with dimension tables, and perform grouping &aggregation.

Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:


A

1

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

2

=file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 || EventTypeID==3))

3

>EventType=file("EventType.btx").import@b().keys@i(EventTypeID)

4

>ProductType=file("ProductType.btx").import@b().keys@i(ProductTypeID)

5

>Product=file("Product.btx").import@b().keys@i(ProductID)

6

>Product=Product.switch(ProductTypeID, ProductType:ProductTypeID)

7

=A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:EventTypeID)

8

=A7.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

A2 Retrieve records within the specified time range and where the event type is “submit order” and “search” from the composite table file and create a cursor based on them.

A3 Load dimension data from bin file EventType.btx, and set primary key and create index on it.

A4 Load dimension data from bin file ProdcutType.btx and set ProductTypeID as its primary key.

A5 Load dimension data from bin file Prodcut.btx and set ProductID as the primary key.

A6 Establish association between Product and ProductType.

A7 Associate A2’s cursor with in-memory dimension tables Product and EventType respectively.

A8 Perform grouping calculation to generate small result sets from A7’s cursor, the joining result.

When using switch() function to join with a dimension table, you need to set primary key for the dimension table in advance and perform join operation through the key. The join amounts to adding references of dimension table records in the associated field of the original table. Then you can reference any field of the dimension table using the syntax "field of the original table.field of the dimension table".

With hierarchical dimension tables, like table T – Product table – ProductType table in this instance, you can use the dot operator (.) to reference levels of dimension tables one by one, such as "ProductID.ProductTypeID.ProductType". It means that table T’s ProductID field references ProductTypeID field in its dimension table and then ProductType field in the dimension table’s dimension table.

3. Dimension tables are often used repeatedly. As they are generally not large, we can load them into the memory, establish associations and store each of them as a global variable. Then there is no need to load dimension tables and establish associations again for summarizations and just use the global variable directly. This way the above code can be regarded as including two parts. The first part is to load dimension tables as global variables at the startup of the server. The second part is the code for performing summarization.

Part one (execute once at the startup of the server):


A

1

=file("EventType.btx").import@b().keys@i(EventTypeID)

2

=file("ProductType.btx").import@b().keys@i(ProductTypeID)

3

=file("Product.btx").import@b().keys@i(ProductID)

4

>env(EventType,A1),env(ProductType,A2),env(Product,A3)

5

>Product.switch(ProductTypeID,ProductType:ProductTypeID)

A4 Store each of the in-memory dimension tables as global variables for reference by the code for performing summarization.

Part two (summarization):


A

1

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

2

=file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 || EventTypeID==3))

3

=A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:EventTypeID)

4

=A3.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

Execution result:

EventTypeID

ProductTypeID

EventType

ProductType

Quantity

Num

iNum

3

1

Search

Fruits

0

499586

48735

3

2

Search

Home&Personalcare

0

508897

49872

3

3

Search

Meat

0

403213

39923

3

4

Search

Beverage

0

324567

29045

3

5

Search

Bakery

0

335498

30234

5

1

Submit

Fruits

206938

103469

13523

5

2

Submit

Home&Personalcare

463188

154396

14656

5

3

Submit

Meat

94378

93366

8754

5

4

Submit

Beverage

217504

54376

5233

5

5

Submit

Bakery

339480

67896

5844