User Behavior Analysis in Practice 6: Numberizing the Dimension Table

 

Target task:

We have a user events table T. Below is its structure and part of its data:

Time

UserID

EventTypeID

ProductID

Quantity

2022/6/1 10:20

1072755

3

100001


2022/6/1 12:12

1078030

2

100002


2022/6/1 12:36

1005093

5

100003

3

2022/6/1 13:21

1048655

1



2022/6/1 14:46

1037824

6



2022/6/1 15:19

1049626

4

100004

4

2022/6/1 16:00

1009296

5

100005

6

2022/6/1 16:39

1070713

2

100006


2022/6/1 17:40

1090884

3

100007


Fields in table T:

Field name

Data type

Description

Time

Datetime

Time stamp of an event, accurate to milliseconds

UserID

String

User ID

EventTypeID

Integer

Event type ID

ProductID

String

Product ID

Quantity

Numeric

Quantity

Dimension table EventType:

EventTypeID

EventType

1

Login

2

Browse

3

Search

4

AddtoCart

5

Submit

6

Logout

Dimension table Product:

ProductID

ProductName

Unit

Price

ProductTypeID

100001

Apple

Pound

5.5

1

100002

Tissue

Packs

16

2

100003

Beef

Pound

35

3

100004

Wine

Bottles

120

4

100005

Pork

Pound

25

3

100006

Bread

Packs

10

5

100007

Juice

Bottles

6

4

Fields in dimension table Product:

Field name

Data type

Description

ProductID

String

Product ID

ProductName

String

Product name

Unit

String

Sales unit

Price

Numeric

Unit prices

ProductTypeID

Integer

Product type ID

Dimension table ProductType:

ProductTypeID

ProductType

1

Fruits

2

Home&Personalcare

3

Meat

4

Beverage

5

Bakery

Relationship between tables:

Computing task:

Calculate the total sales amount, number of orders, search frequency and the number of distinct users performing search and ordering under each type of product within a specified time period.

Techniques involved:

1. Associate with dimension table using ordinal-number-based location.

In both EventType and ProductType tables, primary keys are ordinal natural numbers. They can be directly used to locate records and achieve associations without creating index and computing and comparing HASH values. This can boost performance.

2. Convert a dimension table’s primary key values that are non-ordinal-numbers into ordinal natural numbers so that ordinal-number-based location can be used to speed up association.

Product table’s primary key values are not ordinal numbers represented by natural numbers, but we can first transform them into ordinal numbers and, in the meantime, convert ProductID field values in user events table T into corresponding ordinal number. Now ordinal-number-based location can be used.

Sample code

The code has five parts:

1. According to practices in previous essays, we dump data in those dimension tables as bin files EventType.btx, Product.btx and ProductType.btx that are respectively ordered by EventTypeID, ProductType and ProductTypeID.

2. Join user events table T and EventType table, Product table and ProductType table through ordinal numbers.

3. Add an ordinal number field to Product table.

4. Dump data from user events table T to store it in composite table T.ctx ordered by Time field, and transform ProductID values into ordinal numbers of corresponding records in Product table.

5. Import each dimension table into memory, open the composite table cursor, establish associations with dimension tables during which the association with Product table is through ordinal numbers. It is not necessary to set primary key and create index for dimension table Product.

1. Dump dimension table data


A

1

=connect("demo")

2

=A1.query("select * from Product")

3

=file("Product.btx").export@b(A1)

4

=A1.query("select * from EventType order by EventTypeID")

5

=file("EventType.btx").export@b(A4)

6

=A1.query@x("select * from ProductType order by ProductTypeID")

7

=file("ProductType.btx").export@b(A6)

A4 Sort by EventTypID.

A6 Sort by ProductTypID.

2. Change types of association between EventType table and table T, ProductType table and table T into ordinal number-based association.


A

1

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

2

=file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 || EventTypeID==3))

3

>EventType=file("EventType.btx").import@b()

4

>ProductType=file("ProductType.btx").import@b()

5

>Product=file("Product.btx").import@b().keys@i(ProductID)

6

>Product=Product.switch(ProductTypeID, ProductType:#)

7

=A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:#)

8

=A7.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

A3-A4 Do not set create indexes on primary keys for EventType and ProductType.

A6 Change type of association between ProductType table and table T into ordinal number-based association.

A7 Change type of association between EventType table and table T into ordinal number-based association.

3. Add an ordinal number field in Product table.

Original data: add an ordinal number field to it directly.


A

1

=connect("demo").query@x("select * from Product").derive(#:ProductNum)

2

= file("Product.btx").export@b(A1)

Updated data: Whenever a dimension table is updated, it is wholly retrieved and compared with the dumped btx file. The comparison is based on ordinal numbers of the dumped records, otherwise historical data will be mismatched. The newly-increased data is placed at the end. Usually there are no deletion actions on dimension tables. Any deletion of dimension data will cause errors when historical records of the fact table try to reference records of the dimension table.


A

1

.keys@i(ProductID)=connect("demo").query@x("select * from Product").derive(:ProductNum)

2

= file("Product.btx").import@b().keys@i(ProductID)

3

=A1.select(A2.find(A1.ProductID)==null)

4

=A2.(if(r=A1.find(A2.ProductID),r,~) )

5

=(A4|A3).run(ProductNum=#)

6

=file("Product.btx").export@b(A5)

A1 Load the updated dimension table, add ProductNum field and set primary key and index.

A2 Load the original dimension table Product from the corresponding bin file and set primary key.

A3 Find the newly-increased records in the updated dimension table.

A4 If a record in the original dimension table exist in the updated one, use the new record; if it does not exist, use the original record.

A5 Union A4 and A3 and set ordinal numbers. As A4 keeps the order of the original dimension table, ordinal numbers in the original dimension table are retained.

A6 Write A5’s result to a bin file.

4. The code of preparing file T.ctx, during which ProductID field values are changed into ordinal numbers.

Take stocked data as an example:


A

1

>Product=file("Product.btx").import@b().keys@i(ProductID)

2

=connect("demo").cursor@x("select * from T order by Time")

3

=A2.run(ProductID=Product.find(A2.ProductID).ProductNum)

4

=file("T.ctx").create@y(#Time,UserID,EventTypeID, ProductID, Quantity)

5

.append(A3)=A4

6

>A4.close()

A1 Load dimension table Product into the memory and create index on primary key.

A2 Sort table T by time while retrieving data from it.

A3 Replace ProductID field values in table T with ordinal numbers of corresponding records in the dimension table.

A4 Create a composite table.

A5 Append data of table T to A4’s composite table.

Similar code for the newly-increased data.

5. Perform the whole analysis on the converted data, all through ordinal numbers.

Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:


A

1

>EventType=file("EventType.btx").import@b()

2

>ProductType=file("ProductType.btx").import@b()

3

>Product=file("Product.btx").import@b()

4

>Product=Product.switch(ProductTypeID, ProductType:#)

5

=file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 || EventTypeID==3))

6

=A5.switch(ProductID,Product:#; EventTypeID,EventType:#)

7

=A6.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

A1-A3 Do not set primary keys.

A6 A join via ordinal numbers.

Execution result:

EventTypeID

ProductTypeID

EventType

ProductType

Quantity

Num

iNum

3

1

Search

Fruits

0

499586

48735

3

2

Search

Home&Personalcare

0

508897

49872

3

3

Search

Meat

0

403213

39923

3

4

Search

Beverage

0

324567

29045

3

5

Search

Bakery

0

335498

30234

5

1

Submit

Fruits

206938

103469

13523

5

2

Submit

Home&Personalcare

463188

154396

14656

5

3

Submit

Meat

94378

93366

8754

5

4

Submit

Beverage

217504

54376

5233

5

5

Submit

Bakery

339480

67896

5844