User Behavior Analysis in Practice 8: The Changing Dimension Table

 

Target task

We have a user events table T. Below is its structure and part of its data:

Time

UserID

ProductID

Quantity

2022/6/1 10:20

1072755

1

7

2022/6/1 12:12

1078030

2

8

2022/6/1 12:36

1005093

3

3

2022/6/1 13:21

1048655

4

9

2022/6/1 14:46

1037824

5

5

2022/6/1 15:19

1049626

6

4

2022/6/1 16:00

1009296

7

6

2022/6/1 16:39

1070713

8

7

2022/6/1 17:40

1090884

9

4

Fields in table T:

Field name

Data type

Description

Time

Datetime

Time stamp of an event, accurate to milliseconds

UserID

String

User ID

ProductID

Integer

Product ID

Quantity

Numeric

Quantity

Dimension table Product:

ProductID

ProductName

Unit

Origin

ProductType

1

Apple

Pound

Shandong

Fruits

2

Tissue

Packs

Guangdong

Home&Personalcare

3

Beef

Pound

Qingdao

Meat

4

Wine

Bottles

Shanxi

Beverage

5

Pork

Pound

Xizang

Meat

6

Bread

Packs

Beijing

Bakery

7

Juice

Bottles

Xinjiang

Beverage

Fields of dimension table Product:

Field name

Data type

Description

ProductID

String

Product ID

ProductName

String

Product name

Unit

String

Unit

ProductTypeID

String

Product type

Origin

Numeric

Product origin

Computing task:

Find the sales quantity of each type of product of each place of origin within the specified time period.

One thing we should take into consideration is that the place of origin is not fixed, and get sales quantity of a type of product according to the place of origin in the current transaction.

Techniques involved:

Employing time key on the dimension table. You can find more about SPL time key mechanism in Why Are There So Many Snapshot Tables in BI Systems?.

1. Generate a dimension table with time key according to the dimension table in the production system at regular time interval.

Add a time key field to Product.btx. Here’s the dimension table’s new structure:

Field name

Data type

Description

eTime

Datetime

Effective time for the current record

ProductID

String

Product ID

ProductName

String

Product name

Unit

String

Sales unit

ProductTypeID

String

Product type

Origin

Numeric

Place of origin

Perform daily data dump on Product.btx with a time key according to the updated information in the production system’s Product table every day.

2. Summarize data using the dimension table with a time key

Define a new composite primary key using eTime and ProductID for the dimension table while specifying eTime as the time key. Table T is associated with Product through Time and ProductID fields. Programmers do not need to take care of how to achieve the time key because SPL is designed to automatically handle the related computing logic. They just use the dimension table as a regular one.

Sample code

1. Dump dimension table Product.btx.

For original data: Directly add the time key field.


A

1

=connect("demo").query@x("select * from Product").derive(now():eTime)

2

= file("Product.btx").export@b(A1)

When data is updated: Import the new dimension table, get the record with the latest time under each type of product from the dumped btx file with the time key, compare them with the new table, and append a new record to the btx file for the updated data and record a new effective time.


A

1

=connect("demo").cursor@x("select * from Product")

2

= T("Product.btx").keys@it(ProductID,eTime)

3

=A1.select(cmp(~.array(),A2.find(A1.ProductID).array().to(2,))!=0)

4

= A3.fetch().derive(now():eTime)

5

=file("Product.btx").export@ab(A4)

A1 Import the new dimension table and set index on the primary key.

A2 Import the old dimension table Product from the btx file, get record having the latest time for each product, and set index on the primary key.

@t option means the last key field is the time key.

A3 Get records under each primary key value from the new dimension table where the latest (largest) eTime is different from that in the records under same key value in the old dimension table. The find function will automatically select the record having the latest time corresponding to same key value.

A4 Add effective time field to A3’s records.

A5 Append and export A4’s records to the bin file.

2. Join the fact table and the dimension table and perform aggregation

Suppose T.ctx is already generated as the above explains and sorted by Time:


A

1

>Product=T("Product.btx").keys@ti(ProductID,eTime)

2

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

3

=file("T.ctx").open().cursor(ProductID,Quantity,Time;Time>=start && Time<=end)

4

=A3.switch(ProductID:Time,Product)

5

=A4.groups(ProductID.ProductType,ProductID.Origin; sum(Quantity):Quantity)

A1 Import dimension table Product, and set index on the primary key while specifying the time key.

A4 Join the fact table and the dimension table using the familiar SPL join syntax. As the joining fields contains a time field, the operation will find the record having the latest time (the largest time that is smaller than eTime) before the current time field value.

Execution result:

ProductType

Origin

Quantity

Fruits

Shandong

1241628

Fruits

Xinjiang

546357

Fruits

Hainan

24526

Home&Personalcare

Guangdong

7411008

Meat

Qingdao

3303230

Meat

Neimeng

657546

Meat

Xizang

2456235

Bakery

Beijing

247673

Beverage

Xinjiang

3526574

Beverage

Shanxi

6090112