Performance optimization skill: Numbered Foreign Key

I. Problem introduction & solving

  Loading dimension tables in the memory and pre-joining them has effectively increased performance when the memory isn’t large enough to hold all data files (See Performance Optimization Tricks: Pre-Joining of Dimensions). But a faster computation is always what we are after. So I’d like to introduce a special trick for optimizing dimension association to the fact table that traditionally requires hash computation and comparison.

  The idea is this: if a dimension table’s primary key values are natural numbers starting from 1 – which are equivalent to row numbers, we can locate a record in the dimension table through the key values, i.e. according to the row numbers, rather than by computing and comparing hash values. This speeds up the association of a dimension table to the fact table and eventually makes the whole computation faster. The number-based location makes it unnecessary to create indexes on dimensions and thus reduces memory consumption.

  In the following part, I’ll illustrate how SPL implements our old query in in the old test environment(explained in the previous article) by numbering foreign key values and let’s see how much of the increase in performance with this trick.

 

II. The pre-computation warm-up

  To number foreign key values requires that a dimension table’s primary key values be the numbers (row numbers). Yet very few of the cases in real-world businesses meet that requirement. So first we need to convert the primary key values into numbers:

  1Create a list of key value – number pairs where the correspondences between each key value and a natural number are stored;
  2Replace the dimension table’s primary key values by natural numbers to get a new dimension table;
  3Convert the fact table’s foreign key values into corresponding numbers in the dimension table according to the key value – number list to get a new fact table;

  The three dimension tables and the fact table we are going to use to test the trick are supplier, part, orders and lineitem.

 

1. Numbering supplier table

  The SPL script for replacing the table’s key values by natural numbers (The SPL script in the following sections):


A

1

=file("/home/tbl/supplier.tbl").cursor(;  , "|").new(_1:S_SUPPKEY,   _2:S_NAME, _3:S_ADDRESS, _4:S_NATIONKEY,    _5:S_PHONE, _6:S_ACCTBAL).fetch().derive(#:xh)

2

=file("/home/btx/supplier_xh_key.btx").export@b(A1,S_SUPPKEY,xh)

3

=A1.new(xh:S_SUPPKEY,  S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE,   S_ACCTBAL)

4

=file("/home/ctx/supplier_xh.ctx").create(#S_SUPPKEY,  S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE,   S_ACCTBAL)

5

>A4.append(A3.cursor())

  A1: Read in data from the source file supplier.tbl and add a new column xh using derive() function whose values are row numbers.

  A2: Output A1’s S_SUPPKEY field and xh field to bin file supplier_xh_key.btx to generate a list of key value – number pairs for numbering the fact table.

  A3: Replace S_SUPPKEY field values by xh field values to restructure the supplier table sequence.

  A4: Create composite table supplier_xh.ctx to hold the numbered supplier table.

  A5: Output data in the restructured supplier table sequence to the composite table.

 

2. Numbering part table

  The SPL script:


A

1

=file("/home/tbl/part.tbl").cursor(;  , "|").new(_1:P_PARTKEY,   _2:P_NAME, _3:P_MFGR, _4:P_BRAND,    _5:P_TYPE, _6:P_SIZE, _7:P_CONTAINER, _8:P_RETAILPRICE).fetch().derive(#:xh)

2

=file("/home/btx/part_xh_key.btx").export@b(A1,P_PARTKEY,xh)

3

=A1.new(xh:P_PARTKEY,  P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE,   P_CONTAINER, P_RETAILPRICE)

4

=file("/home/ctx/part_xh.ctx").create(  #P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE,    P_SIZE, P_CONTAINER, P_RETAILPRICE)

5

>A4.append(A3.cursor())

  The script is phrased in the same way as the previous one. The list of key value – number pairs is part_xh_key.btx and the numbered part file is stored as composite table part_xh.ctx.

 

3. Numbering orders table

  The SPL script:


A

1

=file("/home/tbl/orders.tbl").cursor(;  , "|").new(_1:O_ORDERKEY,   _2:O_CUSTKEY,    _3:O_ORDERSTATUS,_4:O_TOTALPRICE,    _5:O_ORDERDATE,  _6:O_ORDERPRIORITY,_7:O_SHIPPRIORITY).fetch().derive(#:xh)

2

=file("/home/btx/orders_xh_key.btx").export@b(A1,O_ORDERKEY,xh)

3

=A1.new(xh:O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,   O_ORDERDATE,O_ORDERPRIORITY,O_SHIPPRIORITY)

4

=file("/home/ctx/orders_xh.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,   O_ORDERDATE,O_ORDERPRIORITY,O_SHIPPRIORITY)

5

>A4.append(A3.cursor())

  The script is phrased in the same way as the previous one. The list of key value – number pairs is orders_xh_key.btx and the numbered part file is stored as composite table orders_xh.ctx.

 

4. Numbering lineitem table

  The SPL script:


A

1

=file("/home/tbl/lineitem.tbl").cursor(;  , "|").new(_1:L_ORDERKEY,  _4:L_LINENUMBER, _2:L_PARTKEY, _3:L_SUPPKEY,  _5:L_QUANTITY,   _6:L_EXTENDEDPRICE,_7:L_DISCOUNT, _8:L_TAX, _9:L_RETURNFLAG,  _10:L_LINESTATUS,_11:L_SHIPDATE,  _12:L_COMMITDATE,   _13:L_RECEIPTDATE,_14:L_SHIPINSTRUCT, _15:L_SHIPMODE, _16:L_COMMENT)

2

=file("/home/btx/orders_xh_key.btx").import@b()

3

=file("/home/btx/part_xh_key.btx").import@b()

4

=file("/home/btx/supplier_xh_key.btx").import@b()

5

=A1.switch(L_ORDERKEY,A2:O_ORDERKEY;L_PARTKEY,A3:P_PARTKEY;L_SUPPKEY,A4:S_SUPPKEY)

6

=A5.run(L_ORDERKEY=L_ORDERKEY.xh,  L_PARTKEY=L_PARTKEY.xh,   L_SUPPKEY=L_SUPPKEY.xh)

7

=file("/home/ctx/lineitem_xh.ctx").create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,  L_SUPPKEY, L_QUANTITY,  L_EXTENDEDPRICE,L_DISCOUNT, L_TAX,  L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE,   L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE,  L_COMMENT;L_ORDERKEY)

8

>A7.append(A6)

  A1: Create a cursor to read data from the source lineitem file.

  A2/A3/A4: Import the three dimension tables’ key value – number pairs lists respectively.

  A5: Create association of L_ORDERKEY to orders’s key value – number pairs list, L_PARTKEY to part’s correspondence list and L_SUPPKEY to supplier’s list.

  A6: Replace values of each of those keys by numbers in the corresponding list.

  A7: Create composite table lineitem_xh.ctx to hold data of the numbered lineitem file.

  A8: Append data of the numbered lineitem file cursor to the composite table.

 

III. Tests

1. Query without optimization

  The SPL script for preloading the dimension tables:


A

1

>env(supplier,   file("/home/ctx/supplier.ctx").create().memory().keys@i(S_SUPPKEY))

2

>env(part,   file("/home/ctx/part.ctx").create().memory().keys@i(P_PARTKEY))

3

>env(orders,file("/home/ctx/orders.ctx").create().memory().keys@i(O_ORDERKEY))

  It loads every dimension table into the memory and creates an index on its keys.

  SPL test script to implement the query:


A

1

=file("/home/ctx/lineitem.ctx").create().cursor(L_ORDERKEY,L_PARTKEY,    L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)

2

=A1.switch(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier)

3

=A2.select(L_ORDERKEY.O_TOTALPRICE>0  && L_PARTKEY.P_SIZE>0 &&   L_SUPPKEY.S_ACCTBAL<999999)

4

=A3.groups(year(L_SHIPDATE):l_year;  sum(L_EXTENDEDPRICE * (1 -   L_DISCOUNT)):revenue)

  First execute the preloading script and then the test script. It takes 450 seconds to execute the test script.

 

2. Query with numbered foreign key

  The SPL script for pre-loading the dimension tables:


A

1

>env(supplier,   file("/home/ctx/supplier_xh.ctx").create().memory())

2

>env(part,   file("/home/ctx/part_xh.ctx").create().memory())

3

>env(orders,file("/home/ctx/orders_xh.ctx").create().memory())

  There’s no need to create indexes because the keys of dimension tables are already numbered.

  SPL test script to implement the query:


A

1

=file("/home/ctx/lineitem_xh.ctx").create().cursor(L_ORDERKEY,L_PARTKEY,    L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)

2

=A1.switch(L_ORDERKEY,orders:#;L_PARTKEY,part:#;L_SUPPKEY,supplier:#)

3

=A2.select(L_ORDERKEY.O_TOTALPRICE>0  && L_PARTKEY.P_SIZE>0 &&   L_SUPPKEY.S_ACCTBAL<999999)

4

=A3.groups(year(L_SHIPDATE):l_year;  sum(L_EXTENDEDPRICE * (1 -   L_DISCOUNT)):revenue)

  Note: The “:#” used in A2 enables an association of the fact table’s foreign key values to a dimension table’s row numbers. The foreign key value 7, for instance, will directly link to row 7 in the dimension table.

First execute the preloading script and then the test script. It takes 269 seconds to execute the test script.

 

3. Summary

  In the two tests, the numbers of records and fields in both the dimension tables and the fact table remain unchanged. The only change is that their key values are replaced by numbers. The two queries select same records for computation using same filtering condition - that is to say, they spend same time in data filtering and grouping. The difference lies in the ways the association is implemented. One is by row numbers while the other is through hash value comparison. By numbering foreign key values and locating corresponding records in a dimension table via row numbers, the query becomes 181 faster than before. It’s amazing.

 

IV. Extended explanations

  You can only number a dimension table’s primary key. There’s no restriction on the primary key values’ data type. It can be integer, string, date, time, etc. If a dimension table has a multi-field primary key, add another one or more number fields and create a match list where one number corresponds to multiple key values to number the fact table’s foreign key.

  It’s convenient to apply the trick to historical data queries. Numbering the historical data once is enough and there’s no need to retain the list of key value - number pairs.

  The trick is also suitable for querying newly-added data with only a few more steps.

1. When both the dimension table(s) and the fact table have new data

  1)Get new records from the dimension table(s) and append the new key values and their numbers to the match list;

  2) Append new records to the numbered dimension table according to the match list;

  3) Append new records to the numbered fact table according to the match list.

2. When only the fact table has new data

  Just perform step 3 to handle the new data.