Performance Optimization - 6.7 [Foreign key association] Big dimension table search
When traversing the fact table and using the foreign key to search the dimensional table records, only one record will be taken at a time; However, the fact table is usually not ordered by the foreign key fields (the fact table may have multiple foreign keys. Being ordered by one foreign key will not be ordered by another. In most cases, the fact table is not ordered by any foreign key), and dimension table records are also randomly found; If the number of records in the fact table is slightly large (not too large to be loaded into memory), reading the dimension table is a typical frequent, random and small-amount-data access, which will lead to an extremely low performance to the hard disk.
Fortunately, the stability of dimension table is relatively high, it is usually not very large and can be loaded into memory. This is the case discussed in previous sections.
Occasionally, however, we will encounter a situation where the dimension table is too large to be loaded into memory. For the dimension table in the external storage, we can no longer use the above algorithms, otherwise the hard disk will face the frequent, random and small-amount-data access.
Let’s first consider the case where the fact table is relatively small and can be loaded into memory.
Associating the dimension table with the foreign key is essentially a search action. For the big dimension table, it is the search in external storage. Reviewing the algorithms introduced earlier, we will find that in order to obtain high performance, the data table needs to be stored orderly by the to-be-searched key key (the primary key of dimension table), or build an index.
If there are multiple records in the fact table, there will be multiple foreign keys, the problem becomes the batch searches in external storage. In this case, the foreign keys of the fact table need to be gathered for searching to avoid frequent access caused by separate search.
The dimension table stored in the bin file is ordered by the primary key, and the dimension table needs to be searched randomly, so it cannot appear as a cursor. After the foreign key values of the fact table are concatenated into a sequence, using the iselect()function (binary search) can relatively quickly fetch the required dimension table records, avoiding the traversal of whole big dimension table. The foreign key of fact table may have duplicate values, we need to use the id() to do DISTINCT operation and sort at the same time, and then search them in the dimension table file, and finally associate them with the fact table.
SPL encapsulates this process into a joinx@q() function:
Unlike the addressization method used earlier, in this code, only part of records and fields are taken out from the dimension table, and such records and fields are for temporary use only in most situations, in this case, it doesn’t make much sense to do addressization after constructing a table sequence, hence the joinx@q() function directly joins the dimension table fields to the fact table records.
If the dimension table is stored in the composite table, we can also use the index to search, which can obtain better performance.
Generally, the composite table is sorted by the primary key, and even if there is no index, it can also be found more quickly by the binary search.
Similar to in-memory dimension table, this method can also resolve the association of multiple big dimension tables at the same time. After resolving the dimension tables in external storage, we can continue to resolve the in-memory dimension table on the obtained table sequence. Clarifying the relationship between dimension table and fact table is the key foundation for solving the foreign key relationship and obtaining high performance.
The relational algebra theory does not distinguish between dimension table and fact table. For the join of two tables, many databases still read the small table into memory first and traverse the large table. For the current situation, it will first read the fact table into memory and then traverse the dimension tables. For these two situations, i.e., large fact table and small dimension table, big dimension table and small fact table, the database processing method is the same.
The dimension table search method we are discussing here also needs to read the fact table, but it does not need to traverse the big dimension table, and the processing method for the above two situations is not the same. In this method, every record of the fact table needs to participate in the calculation, and traversal of the fact table is unavoidable. For the dimension table, however, not every record needs to participate in the calculation (the maximum number of records used for calculation does not exceed that of fact table). It does not take much time to read the small dimension table, and it will not have much impact even if there is waste due to read the whole dimension table (moreover, as discussed in the previous sections, these dimension tables are usually reused, hence there is no waste). However, it is not worth traversing the big dimension table. When the fact table is relatively small, only a small part of dimension table records will be used, in this case, it is not necessary to traverse the whole big dimension table.
Database, or relational algebra system, has a relatively simple understanding of data association, and does not deeply reflect the more essential characteristics of data association, hence it is impossible to design a higher performance algorithm theoretically, and we have to hope for the engineering optimization of the database. Specifically, some well-designed database optimizers can find that one of the association keys is the primary key of the big table, and will “intelligently” choose the search technology instead of traversal, but they will still get “confused” when many tables involved.