User Behavior Analysis in Practice 12: Using Pseudo Tables

 

Target task

We have a user events table T. Below is its structure and part of its data:

Time

UserID

EventType

OS

Browser

ProductID

f1

f2

f3

f4

f5

2022/6/1 10:20

1072755

Search

Android

IE

100001

true

false

false

true

false

2022/6/1 12:12

1078030

Browse

IOS

Safari

100002

false

false

true

true

true

2022/6/1 12:36

1005093

Submit

Android

Chrome

100003

true

true

true

false

false

2022/6/1 13:21

1048655

Login

Windows

Chrome


false

false

true

true

true

2022/6/1 14:46

1037824

Logout

Android

Edge


false

false

false

true

true

2022/6/1 15:19

1049626

AddtoCart

Windows

Edge

100004

true

true

false

true

false

2022/6/1 16:00

1009296

Submit

IOS

Firefox

100005

false

true

false

false

true

2022/6/1 16:39

1070713

Browse

IOS

Sogou

100006

true

true

true

false

false

2022/6/1 17:40

1090884

Search

Windows

IE

100007

true

false

true

true

false

Fields in table T:

Field name

Data type

Description

Time

Datetime

Time stamp of an event, accurate to milliseconds

UserID

String

User ID

EventType

String

Event type, whose value is Login, Browse, Search, AddtoCart, Submit or Logout

OS

String

Operating system, whose value is Android, IOS, Windows or Unknown

Browser

String

Browser, whose value is IE, Safari, Edge, Firefox, Chrome, Sogou or Unknown

ProductID

String

Product ID, whose value is the ProductID field of dimension table Product

String

Other fields that have enumerated values

f1

Boolean

Whether it is an offsite event or not; value is true or false

f2

Boolean

Whether it is a usual device or not; value is true or false

f3

Boolean

Whether it is a usual browser or not; value is true or false

f4

Boolean

Whether it is a cell phone or not; value is true or false

f5

Boolean

Whether it is the first operation; value is true or false

Boolean

Other fields that have Boolean values

In the previous article User Behavior Analysis in Practice 9: Enumerated Dimension and Tag Dimension, we convert the enumerated field and the tag field into corresponding ordinal numbers and bits respectively. Below is the structure and part of data of converted user events composite table T.ctx:

Time

UserID

EventType

OS

Browser

ProductID

b1

2022/6/1 10:20

1072755

3

1

1

100001

36864

2022/6/1 12:12

1078030

2

2

2

100002

14336

2022/6/1 12:36

1005093

5

1

5

100003

57344

2022/6/1 13:21

1048655

1

3

5


14336

2022/6/1 14:46

1037824

6

1

3


6144

2022/6/1 15:19

1049626

4

3

3

100004

53248

2022/6/1 16:00

1009296

5

2

4

100005

18432

2022/6/1 16:39

1070713

2

2

6

100006

57344

2022/6/1 17:40

1090884

3

3

1

100007

45056

Fields in converted table T:

Field name

Data type

Description

Time

Datetime

Time stamp of an event, accurate to milliseconds

UserID

String

User ID

EventType

Integer

Event type, whose value is an ordinal number of the enumerated sequence

OS

Integer

Operating system, whose value is an ordinal number of the enumerated sequence

Browser

Integer

Browser, whose value is an ordinal number of the enumerated sequence

ProductID

String

Product ID, whose value is the ProductID field of dimension table Product

b1

Integer

Integer field that stores binary fields as bits; the first five bits correspond to whether it is offsite, whether it is the usual device, whether it is the usual browser, whether it is cell phone, and whether it is the first operation

Ordinal numbers in EventType field and their description:

1 Login

2 Browse

3 Search

4 AddtoCart

5 Submit

6 Logout

Ordinal numbers in OS field and their description:

1 Android

2 IOS

3 Windows

4 Unknown

Ordinal numbers in Browser field and their description:

1 IE

2 Safari

3 Edge

4 Firefox

5 Chrome

6 Sogou

7 Unknown

Dimension table Product:

ProductID

ProductName

Unit

Price

ProductType

100001

Apple

Pound

5.5

Fruits

100002

Tissue

Packs

16

Home&Personalcare

100003

Beef

Pound

35

Meat

100004

Wine

Bottles

120

Beverage

100005

Pork

Pound

25

Meat

100006

Bread

Packs

10

Bakery

100007

Juice

Bottles

6

Beverage

Fields in dimension table Product:

Field name

Data type

Description

ProductID

String

Product ID

ProductName

String

Product name

Unit

String

Sales unit

Price

Numeric

Unit price

ProductType

Integer

Product type

Relationship between T and Product:

undefined

Computing task:

Find the number of occurrences of each type of event performed by users who are not newcomers on a local Android or IOS system using Safari, Edge or Chrome under the product type Home & Personal care in each month within a specified time period, and count the distinct users under same conditions.

Techniques involved:

Learn more about SPL pseudo tables in SPL Pseudo Table Data Type Optimization.

We can use the pseudo table to predefine computed columns and foreign keys, and simplify the representations of enumerated dimensions and tag dimensions:

1. Define a computed column Month to calculate the month in the time when the current event occurs.

2. Define pseudo fields EventTypeName, OSName and BrowserName to store correspondence relationships between EventType, OS and Browser’s ordinal numbers and their names.

3. Define the foreign key association on ProductID field between it and the corresponding dimension table.

4. Define field name represented by each bit in b1.

After a pseudo table is defined, we can use it to achieve data dump and summarization. This can significantly reduce the amount of code for summarization.

Sample code

1. Define pseudo table


A

1

=T("Product.btx").keys@i(ProductID)

2

=file("T.ctx").create(#Time,UserID,EventType,OS,Browser,ProductID,……,b1,……)

3

=A2.close()

4

=[{file:"T.ctx",

column:[

{name:"Month",exp:"month@y(Time)"},

{name:"EventType",pseudo:"EventTypeName",enum:["Login","Browse","Search","AddtoCart","Submit","Logout"]},

{name:"OS",pseudo:"OSName",enum:["Android","IOS","Windows","Unknown"]},

{name:"Browser",pseudo:"BrowserName",enum:["IE","Safari","Edge","Firefox","Chrome","Sogou","Unknown"]},

{name:"b1",bits:["f1","f2","f3","f4","f5",…….]},

{name:"ProductID",dim:A1}]

}]

5

=pseudo(A4)

A1 Import dimension table Product, and set primary key and index for it.

A2 Create composite table structure after enumerated fields are converted into ordinal numbers and binary fields are transformed to bits, and write it to the empty file T.ctx (database data will later be directly appended to the file through the pseudo table without writing the conversion code for ordinal numbers and bits).

A3 Close A2’s composite table.

A4 Define a pseudo table based on composite table T.ctx:

{name:"Month",exp:"month@y(Time)"}: The code defines a computed column named Month and evaluated through month@y(Time).

{name:"EventType", pseudo:"EventTypeName", enum:[ "Login", "Browse", "Search", "AddtoCart", "Submit", "Logout"]}: The code defines correspondence relationship between EventType field values and the enumerated sequence, and represents the matching name through pseudo field EventTypeName.

{name:"OS",pseudo:"OSName",enum:["Android","IOS","Windows","Unknown"]}: The code defines correspondence relationship between OS field values and the enumerated sequence, and represents the matching name through pseudo field OSName.

{name:"Browser",pseudo:"BrowserName",enum:["IE","Safari","Edge","Firefox","Chrome","Sogou","Unknown"]}: The code defines correspondence relationship between Browser field values and the enumerated sequence, and represents the matching name through pseudo field BrowserName.

{name:"b1",bits:["f1","f2","f3","f4","f5",…….]}: The code defines field name each bit in the bit-based dimension b1 represents.

{name:"ProductID",dim:A1}: The code defines association relationship on ProductID field between it and dimension table A1.

A5 Generate the pseudo table.

The pseudo table definition string can be saved for direct use in later computations. This can further reduce the amount of code.

2. Use pseudo table to dump data. SPL will automatically convert data into corresponding ordinal numbers and bits according to pseudo table definition, and store the converted data in table T.


A

/ The above code for defining pseudo table

6

=connect("demo").cursor@x("select * from T order by Time")

7

=A5.append@i(A6)

A6 Connect to the database, and import data in table T to generate a cursor.

A7 Retrieve and append data in the cursor to the pseudo table.

3. Perform aggregation using the pseudo table


A

/ The above code for defining pseudo table

6

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

7

=A5.select(Time>=start && Time<=end && ProductID.ProductType=="Home&Personalcare"&& ["Safari","Edge","Chrome"].pos(BrowserName) && ["Android","IOS"].pos(OSName) && ! f1 && f4 && !f5).groups(Month,EventTypeName; count(1):Num, icount(UserID):iNum)

A7 Perform aggregation using the pseudo table. We can use pseudo fields defined for the pseudo table as ordinary fields and treat the pseudo table as an ordinary, simple table without taking care of storage and computing mechanisms.

Execution result:

Month

EventTypeName

Num

iNum

202203

AddtoCart

307603

29252

202203

Browse

596492

58140

202203

Login

672163

65569

202203

Logout

672163

65569

202203

Search

491317

42919

202203

Submit

144552

13901

202204

AddtoCart

615222

58484

202204

Browse

1192970

116265

202204

Login

1344323

131123

202204

Logout

1344323

131123

202204

Search

982637

85843

202204

Submit

289112

27799

202205

AddtoCart

615214

58484

202205

Browse

1192976

116262

202205

Login

1344339

131133

202205

Logout

1344339

131133

202205

Search

982633

85848

202205

Submit

289108

27788

202206

AddtoCart

307635

29256

202206

Browse

596463

58124

202206

Login

672175

65575

202206

Logout

672175

65575

202206

Search

491344

42929

202206

Submit

144573

13887