User Behavior Analysis in Practice 7: Dimension Table Filtering

 

Target task

We have a user events table T. Below is its structure and part of its data:

Time

UserID

ProductID

Quantity

2022/6/1 10:20

1072755

1

7

2022/6/1 12:12

1078030

2

8

2022/6/1 12:36

1005093

3

3

2022/6/1 13:21

1048655

4

9

2022/6/1 14:46

1037824

5

5

2022/6/1 15:19

1049626

6

4

2022/6/1 16:00

1009296

7

6

2022/6/1 16:39

1070713

8

7

2022/6/1 17:40

1090884

9

4

Fields in table T:

Field name

Data type

Description

Time

Datetime

Time stamp of an event, accurate to milliseconds

UserID

String

User ID

ProductID

Integer

Product ID

Quantity

Numeric

Quantity

Dimension table Product:

ProductID

ProductName

Unit

Price

ProductTypeID

1

Apple

Pound

5.5

1

2

Tissue

Packs

16

2

3

Beef

Pound

35

3

4

Wine

Bottles

120

4

5

Pork

Pound

25

3

6

Bread

Packs

10

5

7

Juice

Bottles

6

4

Fields of dimension table Product:

Field name

Data type

Description

ProductID

String

Product ID

ProductName

String

Product name

Unit

String

Sales unit

Price

Numeric

Unit price

ProductTypeID

Integer

Product type ID

Computing task:

Find the number of sold pieces for each product under type IDs 1, 2 and 3 within a specified time period.

Techniques involved:

1. Filtering dimension table before join

We first filter the dimension table, and then join it with the fact table. Thereisnoneedtojudgethespecifiedconditionondimensiontableforfacttablerecordsthatdonotmatchthefiltereddimensiontable. In this way the number of comparisons will decrease considerably.

2. Performing join in cursor

Join the fact table’s cursor with the filtered dimension table, during which fact table records that cannot associate with the dimension table will not be generated. This further reduces time spent in generating records in the joining result set.

3. Index reuse

Use the existing index on the filtered dimension table. When the filtering result set is still large, which means only a small number of records are filtered away, the technique saves your time for re-creating the index.

4. Alignment sequence

We can use an efficient alignment sequence to filter a numberized dimension table by generating a same-length table sequence, whose member values are whether the corresponding dimension records meet the filtering condition. When joining the dimension table with the fact table, we can directly find whether a fact table record matches or not according to its ordinal number. This saves time for performing the join and comparisons.

Sample code

Suppose T.ctx and Product.btx are already created as above introduces. T.ctx is ordered by Time and Product.btx is ordered by ProductID.

1. Filter dimension table Product and join it with T; delete records from table T that do not have matching records in the dimension table during joining.


A

1

>Product=file("Product.btx").import@b().select([1,2,3].pos(ProductTypeID)!=null).keys@i(ProductID)

2

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

3

=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end)

4

=A3.switch@i(ProductID,Product:ProductID)

5

=A4.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A1 Import the dimension table and filter it, and set index on the primary key.

A4 Join the fact table and the dimension table; @i option enables deleting records from the fact table that do not match the dimension table.

2. Move the association action in step 1 to the cursor generation statement, during which fact table records that cannot match won’t be generated in the cursor.


A

1

>Product=file("Product.btx").import@b().select([1,2,3].pos(ProductTypeID)!=null).keys@i(ProductID)

2

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

3

=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end,ProductID:Product)

4

=A3.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A3 Move the action of association between the fact table and the dimension table to the cursor generation statement; fact table records that are not matched won’t be generated in the result set.

3. Use the existing index on filtered dimension table Product


A

1

>Product=file("Product.btx").import@b().keys@i(ProductID)

2

=Product.select@i([1,2,3].pos(ProductTypeID)!=null)

3

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

4

=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end,ProductID:A2)

5

=A4.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A1 Import dimension table Product and set index on primary key.

A2 @i option enables reusing the existing index on the filtered dimension table.

Index reuse does not always make the computation faster. As records that are filtered away need to be deleted from the index table, it takes some time to perform the delete action when there are a lot of records are discarded (the number of desired records is small). In this case re-creating the index may be faster. It is important that we choose an appropriate method according to the actual situation.

4. Use alignment sequence as primary key values of dimension table Product are ordinal numbers


A

1

>Product=file("Product.btx").import@b()

2

Product.([1,2,3].pos(ProductTypeID))=

3

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

4

=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end)

5

=A4.select(A2(ProductID))

6

=A5.groups(ProductID; Product(ProductID).ProductName, sum(Quantity):Quantity)

A1 Import dimension table Product without setting index on the primary key.

A2 Generate an alignment sequence according to Product table. Each of its values is whether the current record meets the filtering condition or not.

A5 Directly locate the target member in A2 according to table T’s ProductID to find out whether the current record satisfies the filtering condition.

5. Move the alignment-sequence-based filtering action to the cursor generation statement


A

1

>Product=file("Product.btx").import@b()

2

Product.([1,2,3].pos(ProductTypeID))=

3

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

4

=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end && A2(ProductID))

5

=A4.groups(ProductID; Product(ProductID).ProductName, sum(Quantity):Quantity)

A4 Move the alignment-sequence-based filtering action to the cursor generation statement. Records that do not meet the filtering condition won’t be generated.

Execution result:

ProductID

ProductName

Quantity

1

Apple

206938

2

Tissue

463188

3

Beef

94378

5

Pork

217504