Performance Optimization - 6.2 [Foreign key association] Instant addressization

 

Address is an in-memory concept. The foreign key addressization can only be implemented in all-in-memory operation, yet the big data often needs an external storage to perform computing.

Let’s first consider the situation where the fact table is large and the dimension table is small, which is also a common situation in reality. The fact table is used to store ever-growing events and will easily become very large, while the dimension table is used to store the code information, and will change little in information growth, be relatively stable in scale, and won’t become too large.

If only the dimension table is in memory, the foreign key addressization for the fact table cannot be performed in advance, which means the pre-association cannot be achieved. In this case, the association can only be made temporarily.

A
1 =file(“product.btx”).import@b().keys@i(id)
2 =file(“orders.btx”).cursor@b(p_id,quantity)
3 =A2.switch(p_id,A1)
4 =A3.groups(p_id.vendor;sum(p_id.price*quantity))

The foreign key addressization can be implemented by using the switch()function on the cursor. It will return a delayed cursor, and the substantive association calculation will occur only when fetching the data. Unlike the all-in-memory operation where the operation is performed based on the original data table because the switch() will change the foreign key field of original data table, the operation in A4 can only be performed based on the result returned by A3, and the current association is made during the cursor reading.

This method is called instant addressization. In this method, only the index established on the dimension table can be reused, and the action of searching the records of dimension table has to be done every time the operation is performed; thus, the performance will be much worse than that of all-in-memory operation.

Theoretically, the instant addressization algorithm will still have more advantages than the hash join algorithm. For the large table that cannot be stored in memory, if the hash join algorithm is used, the records in the table need to be split, according to the hash value, into several parts that can be stored in memory, this process is commonly known as partitioning. After that, the in-memory hash joining between the parts corresponding to the hash value will be performed. Partitioning will cause the external storage to generate additional writing and reading actions resulting in a decrease in performance, while the instant addressization will not result in the generation of the buffer data in external storage, and there are no unnecessary write and read actions.

However, when the database encounters a join operation of two tables, one large and one small, it will not strictly implement the hashing & partitioning operation. Usually, it will read the small table into memory first, and then read the large table data in batches to perform in-memory joining. As a result, the actual complexity and performance are not much different from the above code.

In theory, the relational algebra system cannot distinguish between dimension table and fact table but the size of the tables. In case of two tables, the database optimizer can easily design a correct execution path according to the size of table, but when there are many tables and the association is complex, the optimizer may be “confused” and cannot design a reasonable execution path, instead, it will execute the original hashing & partitioning algorithm. Therefore, we will observe that even if there is only one large table, there may be a sharp decrease in performance when there are many associated tables.

After conceptually distinguishing the dimension table from the fact table, it is clearly that we should first read the dimension table into memory, build the index and make pre-association between the dimension tables, and then traverse the fact table to perform the calculation. Compared with all-in-memory operation, this algorithm has no difference in overall structure, except that the fact table needs to be read with a cursor and the association can only be made with dimension table temporarily, and buffer files will not be generated. Moreover, the pre-association between the index on the dimension table and the dimension table can still be reused.

A
1 =file(“area.btx”).import@b().keys(aid)
2 =file(“product.btx”).import@b()
3 =A2.join(a_id,A1,~:area).keys@i(pid)
4 =file(“orders.btx”).cursor@b(p_id,a_id,quantity)
5 =A4.join(p_id,A3,:product;a_id,A1,:area)
6 =A5.select(product.area.state==area.state)
7 =A6.groups(product.vendor;sum(product.price*quantity))

The pre-processing actions performed in A1, A2, and A3 can be executed when the system is started. The indexes in A1 and A3 can also be reused when associating the dimension table in A5.

Both join()and switch() functions can perform the parallel computing based on multi-cursor. In this example, A4 can be written as a multi-cursor with the cursor@m() option, and the subsequent code does not need to be changed. The in-memory join operation in the previous section can also be converted to the in-memory multi-cursor to perform parallel computing.