Performance Optimization - 6.1 [Foreign key association] Foreign key addressization
Let’s briefly review the concept of foreign key association: when the field F in the data table T is associated with the primary key K of the data table D, D is called the foreign key table of T, also known as the dimension table, T is called the fact table, and field F is called the foreign key field. If the primary key of D has multiple fields, the foreign key association can also be defined in a similar way. The goal of foreign key association is to be able to reference the fields of the record in D corresponding to field F when accessing the records of T.
The characteristic of foreign key association is that the foreign key of fact table must be associated with the primary key of dimension table. Since the primary key is unique, the records of fact table will only be associated with the unique record of dimension table, which is a typical many-to-one association. This characteristic should be taken advantage of when optimizing the performance.
If the amount of data is not very large, both the fact table and the dimension table can be loaded into memory. We can build the association relationship first, and then we can refer to the fields of dimension table quickly.
Suppose that the foreign key p_id in the orders table is associated with the primary key pid of the product table (the foreign key field and the corresponding primary key field are often named the same in the actual design for data table structure, and the small difference is made intentionally here to distinguish), we want to take the product unit price from product table and multiply it by the sales quantity in the order table to obtain the order amount, and then count the total order amount of different manufacturers (also a field in the product table).
In this code, A1 reads the product table, A2 reads the order table, and A3 sets the primary key and establishes the index for the product table. The action of A4 is to convert the foreign key field p_id in A2 to the corresponding record in A1, that is, after the execution of A4, the field p_id in A2 will become a record in A1, which is equivalent to executing:
A3 builds an index to make the switch() run faster. Since the fact table is usually much larger than the dimension table, this index can be reused many times.
When A5 traverses the fact table, since the value of the p_id field is already a record, we can directly use the operator . to reference this field, in this case, both p_id.vendor and p_id.price can be executed normally.
After the switch() operation is performed, the content stored in the p_id field of the fact table A2 in the memory is already the address of a record in the dimension table A1. This action is called foreign key addressization. At this time, when the operator . is used to reference the dimension table fields, we can take out them directly without the need to use the foreign key value to search in A2, which is equivalent to be able to take out the dimension table fields in constant time. Otherwise, even if the fast hash method is used to search the index, it still takes some time, and it’s also related to the size of the dimension table.
This is equivalent to reusing the results of the third step switch(). For the first three steps, that is, reading the table, building the index and using switch() to make association, all are one-off in advance. After that, the operation for the fact table can quickly reference the dimension table fields.
The reason for being able to do so exactly takes advantage of the uniqueness of foreign key association towards the dimension table mentioned above, which means, one foreign key field value will uniquely correspond to one dimension table record, and each p_id can be converted to its uniquely corresponding record in A2.
In fact, the code writing becomes simpler and easier to understand after the foreign key addressization.
This operation is actually the join in the database. The relational algebra does not stipulate that one party of the association must be the primary key, and the uniqueness of either party in the association cannot be guaranteed, nor can the foreign key field be converted to dimension table records in advance. The database usually uses the hash algorithm to perform in-memory joining, that is, calculate the hash value of the associated fields of both tables, put the records with the same hash value into a group, and then traverse the correspondence. The complexity of performing a join operation is equivalent to building an index on the association key of the first table (there is no clear definition for dimension table and fact table when no primary key is assumed), and then traversing the second table so as to use the association key of its each record to search in the first table, the computation amount of this process equals to that performed in A3 and A4 in this example.
The preparations for foreign key addressization will be the same thing (like A3 and A4 do). Once the preparations finished, the result can be reused without the need to build an index and search based on this index each time the join operation is performed. In this way, the performance of subsequent calculations will be much better. If only one operation needs to be performed, plus the time to do the preparations, it will not be faster than the hash algorithm for database.
We can do the foreign key addressization following the read of data table when the system starts, this action is called pre-association. After that, we can deal with the association query and aggregation at a high speed. This method is very meaningful for building a high-performance in-memory database, while it cannot be achieved based on relational algebra theory system.
If the fact table has multiple foreign keys associated with different dimension tables, and dimension table also have dimension tables, the pre-association can be performed in advance for both cases:
In this code, A1, A2 and A3 read data, A4 and A5 perform the pre-correlation, and A6 and A7 are the operation code.
This code will select the orders with the same transaction place and production place, and then count the total order amount by manufacturer. Three table associations are involved in this code, among which the order table has two foreign keys and dimension table, and the region table, as the dimension table, is associated with other tables twice.
In principle, the hash join algorithm can only parse one association relationship at a time. If there are multiple association relationships, it needs to parse many times and traverse related tables many times as well. After distinguishing the fact table from dimension table, we can perform foreign key addressization for multiple foreign keys of the same fact table in one traversal. From the perspective of theoretical analysis, although the comparison times of the hash algorithm that needs multiple traversals and separate parsing are not greater than one traversal that needs to perform multiple foreign key addressizations, each traversal in hash algorithm is accompanied by the generation and copying of a batch of data records, as a result, the actual computation amount of hash algorithm is much greater than that of one traversal.
Moreover, when there are many association layers and data tables involved, the optimizer of the database often cannot find the most appropriate parsing order, and different order will result in a different data generation and amount of copy operation. If each association involves large table, these unnecessary calculations will be very large. We will observe that when the amount of data does not increase much, and only the number of associated tables and layers increases, the computing performance of database may decrease sharply.
After the fact table and dimension table are clearly distinguished, the multi-layer association of multiple tables will also be clear. The addressization for the association between small tables can be performed separately without involving other large tables. In this case, the degree of performance decrease is only linearly related to the data amount and the number of associated layers.
In practice, the association relationship is often much more complex than that in the example here, and the advantage of foreign key addressization will be more obvious.
Using switch() to achieve foreign key addressization can obtain a better computing performance, but it still has a few problems:
1) After the addressization, the value of foreign key field itself is lost, which can be obtained only by taking the primary key from the dimension table, thus, the speed will slow down.
In the example shown above, if we want to get the pid after the addressization, we need to use p_id.pid.
2) If no associated records are found in dimension table, the foreign key field will be converted to null, and the original value will be completely lost.
3)The foreign key may be composed of multiple fields, in this case, we cannot use the switch() function to convert.
SPL also provides the join() function to adapt to these scenarios:
The join()will add a field in the original table sequence to save the address of dimension table record, and the original field value will not change, in this way, all the above problems can be solved. The difference is that join() will return a new table sequence, thus attention should be given to the order when performing the addressization. For the data table that is both a fact table and a dimension table, it needs to associate its own dimension table first to obtain a new data table, and then be associated with the fact table.