Performance Optimization - 6.3 [Foreign key association] Foreign key sequence-numberization

 

Performance Optimization - 6.2 [Foreign key association] Instant addressization

Associating the dimension table with the foreign key of fact table is essentially a search action. However, the pre-association cannot be done when the fact table is larger. If there is a way to improve the search speed, a good association performance can also be obtained.

Usually, the hash index is already established on the dimension table. After reviewing the in-memory search technology discussed in Chapter 1, we can find that the sequence number positioning algorithm is the only method faster than hash search.

If the dimension table’s primary key value itself is a sequence number, it can be used directly, but this is not the case in most situations. Therefore, we need to convert the foreign key value of fact table to a sequence number, which is what we called foreign key sequence-numberization.

A
1 =file(“product.btx”).import@b().keys@i(pid)
2 =file(“orders.btx”).cursor@b()
3 =A2.run(p_id=A1.pfind(p_id))
4 >file(“orders_new.btx”).export@b(A3)

Using pfind() can find out the sequence number of search value in the table sequence, and replace the foreign key value in the original table with this sequence number, after that, we can use faster sequence number positioning algorithm to achieve the association involving external storage.

A
1 =file(“product.btx”).import@b()
2 =file(“orders_new.btx”).cursor@b(p_id,quantity)
3 =A2.switch(p_id,A1:#)
4 =A3.groups(p_id.vendor;sum(p_id.price*quantity))

The parameter # allows the switch()function to use the sequence number positioning, in this case, it is unnecessary to generate the index of dimension table (which can save some memory), and the subsequent calculation code could remain unchanged.

The association performance of foreign key sequence-numberization is almost the same as that of addressization. In this way, the same high performance as in-memory pre-association can be obtained for data table in external storage. The only difference is that the fact table is too large and needs to be read from the external storage.

However, the sequence-numberization needs to prepared in advance, that is, change the foreign key value of fact table. Moreover, since the conversion result depends on the order of records in dimension table, the fact table needs to be regenerated if the dimension table changes such as insertion or deletion, as a result, the maintenance and management cost is relatively high.

The sequence-numberization can also be used in the join() function to adapt to different scenarios where there is the multi-field foreign key, or the foreign key value needs to be retained. Examples for such scenarios will not be given here.

The primary key of some dimension tables is not a sequence number, but it can be associated with a certain sequence number by simple processing. Similar to the sequence number positioning search described in Chapter 1, sometimes we need to use align() to sort the dimension table by sequence number.

For example, in the above example, pid is not a continuous sequence number, since we know that it is always within 1-1000, we can first process the dimension table records to a record sequence sorted by sequence number:

A
1 =file(“product.btx”).import@b().align(1000,pid)
2 =file(“orders.btx”).cursor@b(p_id,quantity)
3 =A2.switch(p_id,A1:#)
4 =A3.groups(p_id.vendor;sum(p_id.price*quantity))

In this case, there is no need to regenerate the fact table, and the original table is available directly.

In another case, the dimension table’s primary key itself is not a sequence number, but the sequence number can be obtained through simple operation.

For example, pid is a string, the first character is a capital letter, and the last two are numbers. In this case, we can use

(asc(left(pid,1))-asc(“A”))*100+int(right(pid,2))+1

to convert pid to a natural number between 1-2700, which can associate with the sequence number. So, is it possible to avoid regenerating the fact table at this time?

It depends on the complexity of conversion calculation. If it is a simple addition or subtraction calculation, it is not a big problem. However, if the conversion can be achieved only by an expression like the one above, it still needs to regenerate the fact table, otherwise, we have to do such complex operation on the foreign key fields of fact table every time we search the records in dimension table, it will be a waste of time. As a result, it is probably not as good as hash index. Therefore, in order to improve performance, we need to do such calculation in advance.

A
1 =file(“orders.btx”).cursor@b()
2 =A1.run(p_id=(asc(left(p_id,1))-asc(“A”))*100+int(right(p_id,2))+1)
3 >file(“orders_new.btx”).export@b(A2)

This conversion has a slight advantage over the previous method, which can be achieved based only on the data of fact table and does not depend on the order of records in dimension table. As long as the dimension table maintains this rule, there is no need to regenerate a new fact table after the insertion or deletion occurs.

A
1 =file(“product.btx”).import@b()
2 =A1.align(2700,(asc(left(pid,1))-asc(“A”))*100+int(right(pid,2))+1)
3 =file(“orders_new.btx”).cursor@b(p_id,quantity)
4 =A3.switch(p_id,A2:#)
5 =A4.groups(p_id.vendor;sum(p_id.price*quantity))

The primary key of dimension table does not need to be converted, which can be done in the data preparation stage after each load.

We also discussed the multi-layer sequence number positioning technology in Chapter 1. The primary key of some dimension tables cannot be used when they are directly associated with sequence number, in this case, using the multi-layer sequence number makes it possible to use such primary keys (take the aforementioned ID card number as an example, if the ID number is directly converted to a sequence number, it will occupy too much memory space, but if the data distribution is appropriate, the ID number can be converted to multi-layer sequence number). In this way, we can establish the multi-layer sequence number index for the dimension tables and then use the foreign key to make association (let’s review again: the foreign key association is essentially a search action, and any index works). Likewise, however, since the operation of converting the primary key to multi-layer sequence number is usually not too simple, a good association performance can be obtained only after the foreign key is converted first, and it still needs to regenerate the fact table. The advantage of this conversion method over full sequence-numberization is only limited to what we just mentioned, that is, this conversion does not depend on the order of records in dimension table, and there is no need to re-convert when the insertion or deletion occurs to the dimension table.

The multi-layer sequence number is a sequence, which is inconvenient to store and read. In SPL, we can use the serial byte to process multi-layer sequence number as a single value.


Performance Optimization - 6.4 [Foreign key association] Inner join syntax
Performance Optimization - Preface