User Behavior Analysis in Practice 9: Enumerated Dimension and Tag Dimension

 

Target task

We have a user events table T. Below is its structure and part of its data:

Time

UserID

EventType

OS

Browser

f1

f2

f3

f4

f5

2022/6/1 10:20

1072755

Search

Android

IE

true

false

false

true

false

2022/6/1 12:12

1078030

Browse

IOS

Safari

false

false

true

true

true

2022/6/1 12:36

1005093

Submit

Android

Chrome

true

true

true

false

false

2022/6/1 13:21

1048655

Login

Windows

Chrome

false

false

true

true

true

2022/6/1 14:46

1037824

Logout

Android

Edge

false

false

false

true

true

2022/6/1 15:19

1049626

AddtoCart

Windows

Edge

true

true

false

true

false

2022/6/1 16:00

1009296

Submit

IOS

Firefox

false

true

false

false

true

2022/6/1 16:39

1070713

Browse

IOS

Sogou

true

true

true

false

false

2022/6/1 17:40

1090884

Search

Windows

IE

true

false

true

true

false

Fields in table T:

Field name

Data type

Description

Time

Datetime

Time stamp of an event, accurate to milliseconds

UserID

String

User ID

EventType

String

Event type

OS

String

Operating system, whose value is Android, IOS, Windows, …… or Unknown

Browser

String

Browser, whose value is IE, Safari, Edge, Firefox, Chrome, Sogou, …… or Unknown

String

Other fields that have enumerated values

f1

Boolean

Whether it is an offsite event or not; value is true or false

f2

Boolean

Whether it is a usual device or not; value is true or false

f3

Boolean

Whether it is a usual browser or not; value is true or false

f4

Boolean

Whether it is a cell phone or not; value is true or false

f5

Boolean

Whether it is the first operation; value is true or false

Boolean

Other fields that have boolean values

Computing task:

Find the number of occurrences of each type of event performed by users who are not newcomers on a local Android or IOS system using Safari, Edge or Chrome within a specified time period, and count the distinct users under same conditions.

Techniques involved:

Find more about enumerated dimensions and tag dimensions in What is the Key to Make the Multi-tag User Profile Analysis Run Faster? and Multidimensional Analysis Backend Practice 7: Boolean Dimension and Binary Dimension.

1. Convert enumerated dimension values to ordinal numbers

According to the value lists of OS and Browser fields, convert values of corresponding fields in events table T into their ordinal numbers in the lists.

2. Convert binary dimension values into bits

Put every 16 binary fields (f1,…) whose values are true/false in events table T into one group, and store 16 values as bits in an integer field. An integer field stores 16 values of the original fields. Note: SPL optimizes the 16-bit integer and can achieve a fast retrieval. Generally, a 16-bit integer is enough. If there are more than 16 binary dimensions, use more integer fields to store them.

After two conversions, events table T now has the following structure:

Field name

Data type

Description

Time

Datetime

Time stamp of an event, accurate to milliseconds

UserID

String

User ID

EventType

String

Event type

OS

Integer

Operating system, whose values are ordinal numbers in the enumerated sequence

Browser

Integer

Browser, whose values are ordinal numbers in the enumerated sequence

b1

Integer

An integer field that stores binary fields using bits

b2

Integer

An integer field that stores binary fields using bits

3. Summarize data on the converted events table T.

Use SPL ordinal-number-based reference and bitwise operations to perform aggregations.

Sample code

1. Basic method of dumping events table T:


A

10

=connect("demo").cursor@x("select * from T order by Time")

11

=file("OS.txt").import@i()

12

=file("Browser.txt").import@i()

13

=A10.new(Time,UserID,EventType,A11.pos(OS):OS,A12.pos(Browser):Browser,……,bits@n(f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16):b1,bits@n(f17,…..,f32):b2,……)

14

=file("T.ctx").create(#Time,UserID,EventType,OS,Browser,……,b1,b2,……)

15

=A14.append(A13)

16

>A14.close()

A10 Import the fact table and sort it by Time.

A11 Import the list of all possible values for OS field.

undefined

A12 Import the list of all possible values for Browser field.

undefined

Replace OS field values with the corresponding ordinal numbers in A11, and transform Browser field values to the corresponding ordinal numbers in A12. If the value list is ordered, we can use @b option in pos function and use binary search to locate the targe ordinal number more quickly.

Store every 16 binary fields whose values are true or false into an integer field by bit, 1 represents true and 0 represents false. bits@n means converting true/false into 1/0.

A14 Create composite table file structure.

A15 Append records of A13 to A14’s composite table file.

2. For any number of enumerated dimensions

When there are a lot of enumerated fields, we can name the value list file after the corresponding field name so that expression can be generated automatically. For example:


A

6

[OS,Browser]

7

=A6.(file(~/".txt").import@i())

8

=A6.("A7("/#/").pos("/~/"):"/~).concat@c()

A6 Define a sequence of enumerated field names.

A7 Import lists of enumerated field values and return the following result:

undefined

A8 Generate an expression that converts enumerated values into corresponding ordinal number in the enumerated sequence. The result is as follows:

A7(1).pos(OS):OS,A7(2).pos(Browser):Browser

Insert A8’s expression into A13’s expression in step 1 in the form of a macro replacement symbol ${A8}, and the code becomes this:

=A10.new(Time,UserID,EventType, ${A8}, bits@n(f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16):b1, bits@n(f17,…..,f32):b2,……)

3. For any number of binary dimensions

When there are many binary fields and their names are similar, such as fn, we can automatically generate a bits expression according to them:


A

1

>n=50

2

=n\16

3

=A2.((a=(#-1)*16,"bits@n("+16.("f"/(~+a)).concat@c()+"):b"/#))

4

="bits@n("+to(A2*16+1,n).("f"/~).concat@c()+"):b"/(A2+1)

5

=(A3|A4).concat@c()

A1 The number of binary fields.

A2 Find how many integers are needed to store the binary fields.

A3 Automatically piece together a bits@n expression.

A4 Take the rest of the fields whose number is less than 16 as the last integer field and generate expression for it.

Here’s A5’s result:

bits@n(f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16):b1,bits@n(f17,f18,f19,f20,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30,f31,f32):b2,bits@n(f33,f34,f35,f36,f37,f38,f39,f40,f41,f42,f43,f44,f45,f46,f47,f48):b3,bits@n(f49,f50):b4

Insert A5’s expression into A13’s expression in step 1 in the form of a macro replacement symbol ${A5}, and the code becomes this:

=A10.new(Time,UserID,EventType, ${A8}, ${A5})

4. The complete code for performing data dump:


A

1

>n=50

2

=n\16

3

=A2.((a=(#-1)*16,"bits@n("+16.("f"/(~+a)).concat@c()+"):b"/#))

4

="bits@n("+to(A2*16+1,n).("f"/~).concat@c()+"):b"/(A2+1)

5

=(A3|A4).concat@c()

6

[OS,Browser]

7

=A6.(file(~/".txt").import@i())

8

=A6.("A7("/#/").pos("/~/"):"/~).concat@c()

9

=connect("demo").cursor@x("select * from T order by Time")

10

=A9.new(Time,UserID,EventType, ${A8}, ${A5})

11

=file("T.ctx").create(#Time,UserID,EventType,OS,Browser,……,b1,b2,……)

12

=A11.append(A10)

13

>A11.close()

5. Basic method of summarizing data on the converted fact table

The filtering condition that “users who are not newcomers use Safari/Edge/Chrome on a local Android/IOS system” corresponds to f1 field whose value is 0, f5 field whose value is 0, OS field whose values are Android/IOS and Browser field whose values are Safari/Edge/Chrome.


A

12

=bits(0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0)

13

=bits(1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0)

14

=file("OS.txt").import@i()

15

=A14.(["Android","IOS"].contain@b(~))

16

=file("Browser.txt").import@i()

17

=A16.(["Chrome","Edge","Safari"].contain@b(~))

18

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

19

=file("T.ctx").open().cursor(UserID, EventType;Time>=start && Time<=end && A15(OS) && A17(Browser) && and(b1,A13)==A12).groups(EventType; count(1):Num, icount(UserID):iNum)

A12 Set the 4th bit as 1 and all the other bits as 0 according to the filtering condition.

A13 Set the 1st, the 4th and the 5th bits that meet the condition as 1 and all the other bits as 0 according to the filtering condition.

A14 Import the sequence of all possible values for OS field.

A15 According to the filtering condition, generate a sequence of same length as A14’s sequence and having members whose values are whether they meet the filtering condition.

A16 Import the sequence of all possible values for Browser field.

A17 According to the filtering condition, generate a sequence of same length as A16’s sequence and having members whose values are whether they meet the filtering condition.

Filter data in table T’s cursor and perform grouping & aggregation. The bitwise operation and(b1,A13)==A12 in A19 means a filtering condition that the 4th bit is 1 and the 1st and the 5th bit are 0.

Execution result:

EventType

Num

iNum

AddtoCart

945674

85476

Browse

1778901

178791

Login

2033000

193400

Logout

2033000

193400

Search

1547931

127539

Submit

467345

43375

6. Summarization on any number of enumerated dimensions

SPL can automatically generate filtering expressions when there are a lot of enumerated dimensions to be filtered:


A

12

="{OS:[Android,IOS],Browser:[Chrome,Edge,Safari]}"

13

=json(A12)

14

=A13.fname()

15

=A14.(file(~/".txt").import@i().(A13.field(A14.~).contain(~)))

16

=A14.("A15("/#/")("/~/")")

17

=A16.concat("&&")

A12 Define a JSON string made up of names of to-be-filtered enumerated fields and filtering values.

A13 Convert A12’s JSON string into a table sequence.

A14 Get the sequence of field names from A13.

A15 Generate sequences of boolean values to which corresponding enumerated fields are aligned. When the sequence of enumerated values passed from A12 is ordered, we can use @b option with contain function to speed up the computation.

undefined

A16 Generate a sequence of filtering expressions.

undefined

A17 Generate a string of filtering expressions: A15(1)(OS) && A15(2)(Browser)

Insert A17’s expression into A19’s expression in step 5 in the form of a macro replacement symbol ${A17}, and the code becomes this:

=file("T.ctx").open().cursor(UserID, EventType;Time>=start && Time<=end && ${A17} && and(b1,A13)==A12).groups(EventType; count(1):Num, icount(UserID):iNum)

7. Filtering summarization on any number of binary dimension fields

SPL can automatically generate filtering expressions when there are a lot of binary dimensions to be filtered:


A

1

[1,5,22]

2

[4,20]

3

50

4

=[0]*A3

5

>A4(A1)=1

6

=[0]*A3

7

>A6(A1|A2)=1

8

=A4.group((#-1)\16)

9

=A6.group((#-1)\16)

10

=A8.len().("and(A9("/~/"),b"/~/")==A8("/~/")")

11

=A10.concat("&&")

A1 Define a sequence of ordinal numbers of binary fields whose values are required to be true according to the filtering condition.

A2 Define a sequence of ordinal numbers of binary fields whose values are required to be false according to the filtering condition.

A3 The total number of binary fields.

A4 Generate a sequence whose members are 0 and having same length as A3.

A5 Assign 1 to A4’s members whose corresponding values are required to be true.

A6 Generate a sequence whose members are 0 and having same length as A3.

A7 Assign 1 to A6’s members whose corresponding values need to be filtered.

A8 Group A4 by putting every 16 members together.

undefined

A9 Group A6 by putting every 16 members together.

undefined

A10 Generate a sequence of filtering expressions.

A11 Concatenate filtering expression into a string.

and(A9(1),b1)==A8(1) && and(A9(2),b2)==A8(2) && and(A9(3),b3)==A8(3) && and(A9(4),b4)==A8(4)

Insert A11’s expression into A19’s expression in step 5 in the form of a macro replacement symbol ${A11}, and the code becomes this:

=file("T.ctx").open().cursor(UserID, EventType;Time>=start && Time<=end && ${A17} && ${A11}).groups(EventType; count(1):Num, icount(UserID):iNum)

8. The complete code for performing the computing task


A

1

[1,5,22]

2

[4,20]

3

50

4

=[0]*A3

5

>A4(A1)=1

6

=[0]*A3

7

>A6(A1|A2)=1

8

=A4.group((#-1)\16)

9

=A6.group((#-1)\16)

10

=A8.len().("and(A9("/~/"),b"/~/")==A8("/~/")")

11

=A10.concat("&&")

12

="{OS:[Android,IOS],Browser:[Chrome,Edge,Safari]}"

13

=json(A12)

14

=A13.fname()

15

=A14.(file(~/".txt").import@i().(A13.field(A14.~).contain(~)))

16

=A14.("A15("/#/")("/~/")")

17

=A16.concat("&&")

18

>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")

19

=file("T.ctx").open().cursor(UserID, EventType;Time>=start && Time<=end && ${A17} && ${A11}).groups(EventType; count(1):Num, icount(UserID):iNum)