Performance optimization skill: For Dimension Table Filtering & Computation

  Often a join query between the fact table and a dimension table involves filtering and computation over dimension data. There are two optimization methods.

  1. First join the fact table and the dimension table (pre-join them or only the dimension tables if they can fit into the memory) and then filtering the joined fact table (See Performance Optimization Tricks: Numbered Foreign Key).

  2. First perform filtering over the dimension table and then join the filtered table with the fact table. A join requires an index on the dimension table’s key. But after filtering the index is unusable and a new one needs to be created.

  Their effects are influenced by the proportion of the size of dimension table(s) to that of the fact table. So we need some tests to find which one is better in different contexts.

 

I. Test environment

  Eight data tables, a total of 50G data size, have been generated according to TPCH benchmark.

  There are two test computers. Each has an Intel2670 16-core CPU, a 2.6 GHz processor, a 128GB RAM and an SSD.

  For the convenience of the test, I adopt the single-threaded processing and just leave the multicore ability alone.

 

II. Whole RAM computation

  This requires all source data files be loaded into the memory in advance. The data files to be computed here are customer, the dimension table, which contains 7.5 million records, and orders, the fact table, which 75 million records.

  The filtering conditions on the dimension table are “left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal”. The goal is to calculate the total orders amount meeting these conditions. To emphasize the differences between tests by increasing computation amount, let’s say the first two conditions should always be true. Parameter bal is used to test the query speeds with different proportions of dimension data size and fact data size.

 

1. Pre-joining

  The SPL script with pre-joined dimension table and fact table:


A

1

>customer=file("/home/ctx/customer.ctx").create().memory().keys@i(C_CUSTKEY)

2

>orders=file("/home/ctx/orders.ctx").create().memory()

3

=orders.switch(O_CUSTKEY,customer)

4

=now()

5

=orders.select(left(O_CUSTKEY.C_NAME,4)!="shen"  && O_CUSTKEY.C_NATIONKEY>-1   && O_CUSTKEY.C_ACCTBAL>bal)

6

=A5.sum(O_TOTALPRICE)

7

=interval@s(A4,now())

  A1 loads the dimension table into the memory and creates an index on its keys. A2 reads the fact table into the memory. A3 pre-joins them. The time counting starts from the execution of A4.

 

2. Index rebuild

  SPL script:


A

1

>customer=file("/home/ctx/customer.ctx").create().memory().keys@i(C_CUSTKEY)

2

>orders=file("/home/ctx/orders.ctx").create().memory()

3

=now()

4

=customer.select(left(C_NAME,4)!="shen"  && C_NATIONKEY>-1 &&   C_ACCTBAL>bal).derive@o().keys@i(C_CUSTKEY)

5

=orders.switch@i(O_CUSTKEY,A4)

6

=A5.sum(O_TOTALPRICE)

7

=interval@s(A3,now())

  A4 filters customer table and then rebuilds an index for it. A5 joins the two tables.

 

3. Index reuse

  SPL supports the reuse of an existing index. So we just need to change A4 in the above script into:

    =customer.select@i(left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal)

  select@i will reuse customer table’s index.

 

4. Numbered foreign key

  First load the numbered customer table and orders table, which are two composite tables - customer_xh.ctx and orders_xh.ctx.

  SPL script:


A

1

>customer=file("/home/ctx/customer_xh.ctx").create().memory()

2

>orders=file("/home/ctx/orders_xh.ctx").create().memory()

3

=now()

4

=orders.switch@i(O_CUSTKEY,customer:#)

5

=A4.select(left(O_CUSTKEY.C_NAME,4)!="shen"  && O_CUSTKEY.C_NATIONKEY>-1   && O_CUSTKEY.C_ACCTBAL>bal)

6

=A5.sum(O_TOTALPRICE)

7

=interval@s(A3,now())

  A number-based join doesn’t need an index, so A1 doesn’t build the index. A4 uses customer:# to represent an association of O_CUSTKEY values to customer table’s row numbers.

 

5. Alignment base sequence build

  First load the numbered customer table and orders table, which are two composite tables - customer_xh.ctx and orders_xh.ctx.

  SPL script:


A

1

>customer=file("/home/ctx/customer_xh.ctx").create().memory()

2

>orders=file("/home/ctx/orders_xh.ctx").create().memory()

3

=now()

4

=customer.(left(C_NAME,4)!="shen"  && C_NATIONKEY>-1 &&   C_ACCTBAL>bal)

5

=orders.select(A4(O_CUSTKEY))

6

=A5.sum(O_TOTALPRICE)

7

=interval@s(A3,now())

  In A4, customer(filtering condition) gets a sequence of same length as the table’s record count and having true or false as its values. This is what I call alignment base sequence. Since orders’ numbered O_CUSTKEY field values match customer’s row numbers in order, we use A4(O_CUSTKEY) in A5 to check if a row in orders meets the filtering condition.

 

6. Test results & analysis

  Here are results of the above test scripts (Unit: second):

Record count after dimension table   filtering

7.16 million

6.13 million

4.77 million

2.73 million

0.68 million

Pre-joining

41

39

38

37

35

Index rebuild

39

34

29

25

19

Index reuse

35

31

27

23

17

Numbered foreign key

53

51

49

48

46

Alignment base sequence build

25

23

21

19

16

  In our tests, the record count in fact table (75 million) is ten times as much as that in dimension table (7.5 million). Both pre-joining and alignment base sequence build first handle the join and then the filtering. The complex filtering operation is performed over the fact table, which means that the computation amount is 10 times more than that of filtering the dimension table. So it takes long to execute the query. But the pre-joining is faster than the alignment base sequence build because it doesn’t need to handle the join any more.

  Both the index rebuild and index reuse handle the filtering over the dimension table and then its join with the fact table. The filtering operation is executed over dimension table rows and so they are faster than the other two methods. Though they require same computation amount for filtering, join and sum, the index reuse is faster because it doesn’t need to build an index. But as the data size after dimension filtering becomes smaller and it takes less time to rebuild an index, their execution time gap becomes smaller, too.

  Alignment base sequence build trick performs filtering first over the dimension table rows and then on the fact table according to the alignment base sequence without handling the join, the index and hash values. It is the fastest among all optimization tricks.

 

III. In-memory dimension table & external memory fact table

  This time we make the 75-million-records orders the dimension table and the 3-billion-record lineitem the fact table. The filtering conditions on the dimension table are “left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price”. The goal is to calculate the total orders amount meeting these conditions. To emphasize the differences between tests by increasing computation amount, let’s say the first three conditions should always be true. Parameter price is used to test the query speeds with different proportions of dimension data size and fact data size.

 

1. Join & filtering

  The SPL script:


A

1

>orders=file("/home/ctx/orders.ctx").create().memory().keys@i(O_ORDERKEY)

2

=now()

3

=file("/home/ctx/lineitem.ctx").create().cursor(L_ORDERKEY,L_EXTENDEDPRICE)

4

=A3.switch@i(L_ORDERKEY,orders)

5

=A4.select(left(L_ORDERKEY.O_ORDERPRIORITY,2)!="9-"  &&   L_ORDERKEY.O_ORDERSTATUS!="A" &&    L_ORDERKEY.O_ORDERDATE>date("1990-01-01") &&  L_ORDERKEY.O_TOTALPRICE>price)

6

=A5.total(sum(L_EXTENDEDPRICE))

7

=interval@s(A2,now())

  A1 loads the dimension table into the memory and creates an index on its keys. The time counting starts from the execution of A2. Since the fact table size is huge, we retrieve data from it through a cursor and join it with the dimension table before performing a filtering.

 

2. Index rebuild

  SPL script:


A

1

>orders=file("/home/ctx/orders.ctx").create().memory().keys@i(O_ORDERKEY)

2

=now()

3

=orders.select(left(O_ORDERPRIORITY,2)!="9-"  && O_ORDERSTATUS!="A"   &&  O_ORDERDATE>date("1990-01-01")   && O_TOTALPRICE>price).derive@o().keys@i(O_ORDERKEY)

4

=file("/home/ctx/lineitem.ctx").create().cursor(L_ORDERKEY,L_EXTENDEDPRICE).switch@i(L_ORDERKEY,A3)

5

=A4.total(sum(L_EXTENDEDPRICE))

6

=interval@s(A2,now())

  A3 filters orders table and then rebuild an index for it.

 

3. Index reuse

  Just change A3’s code in the above script into:

    =orders.select@i(left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price)

  select@i will reuse orders table’s index.

 

4. Numbered foreign key

  First load the numbered orders table, which is the composite table - orders_xh.ctx, without creating index for it.

  SPL script:


A

1

>orders=file("/home/ctx/orders_xh.ctx").create().memory()

2

=now()

3

=file("/home/ctx/lineitem_xh.ctx").create().cursor(L_ORDERKEY,L_EXTENDEDPRICE)

4

=A3.switch@i(L_ORDERKEY,orders:#)

5

=A4.select(left(L_ORDERKEY.O_ORDERPRIORITY,2)!="9-"  &&   L_ORDERKEY.O_ORDERSTATUS!="A" &&   L_ORDERKEY.O_ORDERDATE>date("1990-01-01")  &&   L_ORDERKEY.O_TOTALPRICE>price)

6

=A5.total(sum(L_EXTENDEDPRICE))

7

=interval@s(A2,now())

  A4 uses orders:# to represent an association of L_ORDERKEY values to orders table’s row numbers.

 

5. Alignment base sequence build

  First load the numbered orders table, which is the composite table - orders_xh.ctx, without creating index for it.

  SPL script:


A

1

>orders=file("/home/ctx/orders_xh.ctx").create().memory()

2

=now()

3

=orders.(left(O_ORDERPRIORITY,2)!="9-"  && O_ORDERSTATUS!="A"   &&    O_ORDERDATE>date("1990-01-01") &&   O_TOTALPRICE>price)

4

=file("/home/ctx/lineitem_xh.ctx").create().cursor(L_ORDERKEY,L_EXTENDEDPRICE).select(A3(L_ORDERKEY))

5

=A4.total(sum(L_EXTENDEDPRICE))

6

=interval@s(A2,now())

  Same code explanation as that with a whole RAM computation.

 

6. Test results & analysis

  Here are results of the above test scripts (Unit: second):

Record count after dimension table   filtering

64.43 million

49.95 million

35.90 million

22.49 million

4.28 million

Join & filtering

101

98

97

94

92

Index rebuild

102

98

92

73

53

Index reuse

85

82

77

74

57

Numbered foreign key

79

78

76

75

72

Alignment base sequence build

53

49

47

43

39

  In this set of tests, the record count in fact table (3 billion) is four times as much as that in dimension table (75 million). The query analysis is the same as that for the first set of tests. Difference is that the ratio of the proportion of the fact table size and the dimension table size decreases to 4:1. In this case the time gap between numbered foreign key trick and the index reuse trick becomes small. The numbered foreign key trick is even faster when the number of records filtered away from the dimension table is small because the number-based association is more efficient than hash values comparison.

 

IV. Summary

  About which trick we use to speed up dimension data filtering and computation, there are the following instructions:

1 When the fact table size is smaller than the dimension table size

  1) Use a pre-join if all data files can fit into the memory.

  2) If the data files can’t fit into the memory but the dimension table key and the fact table’s foreign key are already numbered, join the fact table and the dimension table by numbers and then filter the fact table.

  3) If the data files can’t fit into the memory and the dimension table key and the fact table’s foreign key are not numbered, join the fact table and the dimension table via foreign key and then filter the fact table.

2 When the fact table size is much larger than the dimension table size

  1) If the fact table’s foreign key is already numbered, create an alignment base sequence according to which the fact table is matched.

  2) If the fact table’s foreign key isn’t numbered, first filter the dimension table and reuse the old index and then perform a join by the foreign key.

  3 When the fact table size is slightly larger than the dimension table size

  1) If the fact table’s foreign key is already numbered, create an alignment base sequence according to which the fact table is matched.

  2)If the fact table’s foreign key isn’t numbered, run a test to confirm which trick is faster – a pre-joining (if the data files can fit into the memory) or an index reuse.