Performance optimization skill: Association For Dimension Table Filtering & Computation

 

  Often a join query between the fact table and a dimension table involves filtering and computation over the dimension data, which can be optimized in two approaches:

  1. First join the fact table and the dimension table (pre-associate them if they can fit into the memory) and then filter the associated fact table just like the calculations in Performance Optimization Skill: Pre-Association and Performance Optimization Skill: Numberizing Foreign Key. 

  2. First perform filtering on the dimension table and then join the filtered table with the fact table. A join requires an index on the dimension tables key. But after filtering the index cannot be used anymore and a new one needs to be created. 

  Their effects are influenced by the proportion of the size of dimension table(s) to that of the fact table. So we need some tests to find which one is better in different contexts. 

 

I. Test environment

  Eight data tables, 50G data in total, have been generated according to TPCH standard. The structure of TPCH data file is vastly introduced online, which will not be elaborated here.     

  The server for test has two Intel2670 CPUs, 2.6G frequency, 16-core in total, 128G memory and an SSD hard disk.

  The following tests are calculated in single-thread without the help of multi-core to make the difference clearer.  

 

II. All-in-memory computation

   All-in-memory processing means to pre-load all the source data in memory. The data to be computed here are customer, the dimension table with 7.5 million records in total, and orders, the fact table with 75 million records.

  The filtering conditions on the dimension table are left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal. The goal is to calculate the total orders amount meeting these conditions. The first two conditions are always true (to increase the computation amount for the purpose of emphasizing the differences in tests). Parameter bal is used to test the query speeds with different proportions of dimension data size and fact data size. 

 

1. Pre-association

  The SPL script with pre-joined dimension table and fact table:


A

1

>customer=file("/home/ctx/customer.ctx").open().memory().keys@i(C_CUSTKEY)

2

>orders=file("/home/ctx/orders.ctx").open().memory()

3

=orders.switch(O_CUSTKEY,customer)

4

=now()

5

=orders.select(left(O_CUSTKEY.C_NAME,4)!="shen" && O_CUSTKEY.C_NATIONKEY>-1 && O_CUSTKEY.C_ACCTBAL>bal)

6

=A5.sum(O_TOTALPRICE)

7

=interval@s(A4,now())

  A1 loads the dimension table in memory and creates an index on its keys. A2 loads the fact table in memory. A3 pre-associates them. But these previous steps are not counted in test time which only starts from the execution of A4.

 

2. Index recreation

  SPL script:


A

1

>customer=file("/home/ctx/customer.ctx").open().memory().keys@i(C_CUSTKEY)

2

>orders=file("/home/ctx/orders.ctx").open().memory()

3

=now()

4

=customer.select(left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal).derive@o().keys@i(C_CUSTKEY)

5

=orders.switch@i(O_CUSTKEY,A4)

6

=A5.sum(O_TOTALPRICE)

7

=interval@s(A3,now())

  A4 filters customer table and then recreates an index on it. A5 joins the two tables.

 

3. Index reuse

  SPL supports the reuse of an existing index. So we just need to change A4 in the above script into:

    =customer.select@i(left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal)

  Add @i option to select in order to reuse customer tables index.

 

4. Numberized foreign key 

  First load the numberized composite tables customer_xh.ctx and orders_xh.ctx when pre-loading data tables.

  SPL script:


A

1

>customer=file("/home/ctx/customer_xh.ctx").open().memory()

2

>orders=file("/home/ctx/orders_xh.ctx").open().memory()

3

=now()

4

=orders.switch@i(O_CUSTKEY,customer:#)

5

=A4.select(left(O_CUSTKEY.C_NAME,4)!="shen" && O_CUSTKEY.C_NATIONKEY>-1 && O_CUSTKEY.C_ACCTBAL>bal)

6

=A5.sum(O_TOTALPRICE)

7

=interval@s(A3,now())

  A numberization-based association doesnt need an index, so we dont create index in A1. customer:# in A4 is used to associate O_CUSTKEY values with the row numbers of customer table.

 

5. Alignment base sequence building

  First load the numberized composite tables customer_xh.ctx and orders_xh.ctx when pre-loading data tables.

  SPL script:


A

1

>customer=file("/home/ctx/customer_xh.ctx").open().memory()

2

>orders=file("/home/ctx/orders_xh.ctx").open().memory()

3

=now()

4

=customer.(left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal)

5

=orders.select(A4(O_CUSTKEY))

6

=A5.sum(O_TOTALPRICE)

7

=interval@s(A3,now())

  In A4, use customer. (filtering condition) to calculate a sequence of the same length as the number of records with true or false as its values, which is called alignment base sequence. Since the numberized O_CUSTKEY field values in orders table corresponds to row numbers in the customer table in order, we can use A4(O_CUSTKEY) in A5 to check if the row in orders meets the filtering condition. 

 

6. Test results & analysis

  Here are results of the above test scripts (Unit: second):

Number of records of filtered dimension table

7.16 million

6.13 million

4.77 million

2.73 million

0.68 million

Pre-association

41

39

38

37

35

Index recreation

39

34

29

25

19

Index reuse

35

31

27

23

17

Numberized foreign key

53

51

49

48

46

Alignment base sequence building

25

23

21

19

16

  In these tests, the records in fact table (75 million) is ten times as many as those in dimension table (7.5 million).

Both pre-association and numberized foreign key tests do the association first and then the filtering. The complex filtering operation is performed over the fact table, which means that the computation amount is 10 times more than that of filtering the dimension table directly. So it takes the longest to execute the whole query. But the pre-association is faster than the numberized foreign key because it doesnt need to do the association any more during the query. 

  Both the index recreation and index reuse texts do the filtering over the dimension table first and then joining with the fact table. The complex filtering operation is executed on the rows of dimension table, so these two tests are faster than the above two mentioned methods. Though index recreation and index reuse require the same computation amount of filtering, association and sum, the latter is faster because it doesnt need to create an index. However, as the data size after dimension table filtering becomes smaller and small, it will take less time to recreate an index, resulting in narrower execution time differences.

  Alignment base sequence building test performs filtering first on the rows of dimension table and then on the fact table according to the built alignment base sequence without joining with the fact table, the index creation and hash values, therefore, it is the fastest among all the optimization skills.

 

III. In-memory dimension table & external storage fact table

  This time we make the 75-million-records orders as the dimension table and the 3-billion-record lineitem as the fact table.

The filtering conditions on the dimension table are left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price. The goal is to calculate the total orders amount meeting these conditions. The first three conditions are always true (to increase the computation amount for the purpose of emphasizing the differences in tests). Parameter price is used to test the query speeds with different proportions of dimension data size and fact data size. 

 

 

1. Association & filtering

  The SPL script:


A

1

>orders=file("/home/ctx/orders.ctx").open().memory().keys@i(O_ORDERKEY)

2

=now()

3

=file("/home/ctx/lineitem.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE)

4

=A3.switch@i(L_ORDERKEY,orders)

5

=A4.select(left(L_ORDERKEY.O_ORDERPRIORITY,2)!="9-" && L_ORDERKEY.O_ORDERSTATUS!="A" && L_ORDERKEY.O_ORDERDATE>date("1990-01-01") && L_ORDERKEY.O_TOTALPRICE>price)

6

=A5.total(sum(L_EXTENDEDPRICE))

7

=interval@s(A2,now())

  A1 loads the dimension table in memory and creates an index on its key, which is not counted in the test time. The execution time starts from A2. Since the fact table size is huge, we retrieve data from it with a cursor and join it with the dimension table before performing the filtering.

 

2. Index recreation

  SPL script:


A

1

>orders=file("/home/ctx/orders.ctx").open().memory().keys@i(O_ORDERKEY)

2

=now()

3

=orders.select(left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price).derive@o().keys@i(O_ORDERKEY)

4

=file("/home/ctx/lineitem.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE).switch@i(L_ORDERKEY,A3)

5

=A4.total(sum(L_EXTENDEDPRICE))

6

=interval@s(A2,now())

  The orders table are filtered and then an index is created on it in A3.

 

3. Index reuse

  We just change the code of A3 in the above script into:

    =orders.select@i(left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01")&& O_TOTALPRICE>price)

  The @i option is added to select in order to reuse orders tables index.

 

4. Numberized foreign key

  First load the numberized orders table, which is the composite table orders_xh.ctx, without creating index on it.

  SPL script:


A

1

>orders=file("/home/ctx/orders_xh.ctx").open().memory()

2

=now()

3

=file("/home/ctx/lineitem_xh.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE)

4

=A3.switch@i(L_ORDERKEY,orders:#)

5

=A4.select(left(L_ORDERKEY.O_ORDERPRIORITY,2)!="9-" && L_ORDERKEY.O_ORDERSTATUS!="A" && L_ORDERKEY.O_ORDERDATE>date("1990-01-01") && L_ORDERKEY.O_TOTALPRICE>price)

6

=A5.total(sum(L_EXTENDEDPRICE))

7

=interval@s(A2,now())

  orders:# in A4 is used to associate L_ORDERKEY values with the row numbers of orders table.

 

5. Alignment base sequence building

  First load the numberized orders table, which is the composite table orders_xh.ctx, without creating index on it.

  SPL script:


A

1

>orders=file("/home/ctx/orders_xh.ctx").open().memory()

2

=now()

3

=orders.(left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price)

4

=file("/home/ctx/lineitem_xh.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE).select(A3(L_ORDERKEY))

5

=A4.total(sum(L_EXTENDEDPRICE))

6

=interval@s(A2,now())

  The code explanation is the same as that of whole RAM computation. 

 

6. Test results & analysis

  Here are results of the above test scripts (Unit: second):

Number of records of filtered dimension table

64.43 million

49.95 million

35.90 million

22.49 million

4.28 million

Association & filtering

101

98

97

94

92

Index recreation

102

98

92

73

53

Index reuse

85

82

77

74

57

Numberized foreign key

79

78

76

75

72

Alignment base sequence building

53

49

47

43

39

  In these tests, the records in fact table (3 billion) is four times as many as those in dimension table (75 million).

The query analysis is the same as that for the previous section of tests, but the proportion of the fact table size and the dimension table size decreases from 10 times to 4 times. In this case the speed difference between the numberized foreign key and the index reuse becomes quite small. The numberized foreign key skill is even faster when the number of records filtered away from the dimension table is small because the numberization-based association is more efficient than hash values comparison.

 

IV. Summary

  Based on the previous analysis, we can come to a conclusion about which technique we should use to speed up dimension data filtering and computation, and here are the instructions:

1 When the fact table size is smaller than the dimension table size

  1) Perform pre-association if all data can fit into memory.

  2) If the data cant fit into memory but the dimension table key and the fact tables foreign key are already numberized, first join the fact table and the dimension table by numberized key values and then filter the fact table.

  3) If the data cant fit into memory and the dimension table key and the fact tables foreign key are not numberized, first join the fact table and the dimension table via foreign key and then filter the fact table.

2 When the fact table size is much larger than the dimension table size

  1) If the fact tables foreign key is already numberized, create an alignment base sequence according to which the fact table is corresponded.

  2) If the fact tables foreign key isnt numberized, first filter the dimension table and reuse the old index, then perform association by the foreign key.

  3 When the fact table size is slightly larger than the dimension table size

  1) If the fact tables foreign key is already numberized, create an alignment base sequence according to which the fact table is corresponded.

  2)If the fact tables foreign key isnt numberized, wed better do a test to confirm which skill is faster: pre-association (if the data can fit into memory) or index reuse.