Performance Optimization - 6.8 [Foreign key association] One side partitioning


Finally, let’s deal with the situation where both the dimension table and the fact table are very large, usually the fact table will be larger. For this situation, it is very difficult to achieve a high-speed calculation in any way, but we still try to find a way to calculate as quickly as possible.

Is it possible to read the fact table with a cursor, and then execute in batches the dimension table search algorithm as introduced in the previous section?

No, because the fact table is very large, too many batches of search will always find all dimension table records, or even more than once. This algorithm will lead to serious frequent and small-amount-data reads. As mentioned earlier, the total time-consuming of index sorting is not necessarily shorter than that of big sorting algorithm, hence there is a high probability that the performance of this algorithm will not be as good as the algorithm that sorts both the fact table and the dimension table followed by merging and joining.

For the situation where both tables are large, the database generally adopts the hashing & partitioning method which works in a way that first calculate the hash value of the association key in two tables respectively, and then divide the data whose hash value is within a certain range into a part so as to form the buffer data in external storage, and ensure that each part is small enough to be loaded into memory, and finally execute the in-memory join algorithm for each pair of part (two tables) one by one. This method will split and buffer both large tables, which can also be called two-side partitioning method. When we are unlucky with the hash function, a second round of hashing may be required since a certain part may be too large.

If the dimension table is stored orderly by the primary key and can be read in segments after adopting an appropriate storage scheme, thereby we can think that the dimension table has been logically partitioned (we can simply regard each segment as a part, and can also easily calculate an appropriate number of segments after knowing the total size of the data table). At this time, we only need to split and buffer the fact table by the segment of dimension table’s primary key into which fact table’s foreign key falls, i.e., partitioning the fact table, and then gradually associate each part (of the fact table) with the corresponding segment of the dimension table, because the partitioning action has ensured that the fact table records in this part will only be related to the dimension table records in the corresponding segment.

The logical process of the algorithm is as follows:

1 =file(“customer.btx”)
2 =10.(A1.cursor@b(id;~:10).fetch(1).cid)
3 =file(“orders.btx”).cursor@b(c_id,amount)
4 =10.(file(“temp”/~))
5 =A3.groupn(pseg(A2,c_id);A4)
6 func for 10 =A1.curor@b(cid,area,discount;B6:10).fetch()
7 =A4(B6).cursor@b().join(c_id,C6:cid,area,discount)
8 for C7,1000 return C8
9 =cursor@c(A6).groups(area;amount*discount)

The overall process is similar to the two-side partitioning algorithm. However, this algorithm does not need to partition the dimension table, only needs to partition the fact table, hence it can be called one-side partitioning. Moreover, this algorithm can divide the dimension table into equal segments, and there won’t be a situation where we need to do a second round of hashing and partitioning caused by encountering an unlucky hash function. Probably, there may be a too large fact table part, in this case, the algorithm for the large fact table and the small dimension table can be employed, and no secondary partitioning is required. The amount of buffer data from one-side partitioning algorithm is much less than that of two-side partitioning algorithm, and the performance will be better.

The process is still more complex, and SPL also encapsulates the algorithm:

1 =file(“customer.btx”)
2 =file(“orders.btx”).cursor@b(c_id,amount)
3 =A2.joinx@u(c_id,A1:cid,area,discount)
4 =A3.groups(area;amount*discount)

This code uses the joinx() function without @q option, SPL will think that the fact table is also large and needs to be partitioned. Similar to the algorithm in the previous section, the dimension table needs to be accessed in segments, and appear as a data file rather than a cursor.

The order of data fetched from the cursor returned by joinx@u() looks unordered. It is known from the above algorithm that these data are joined and returned after being fetched one by one according to the segments of the dimension table, and the overall order should be the same as the primary key of dimension table. However, the data in each segment are fetched by the order of the fact table parts, and each segment is not necessarily ordered by the primary key of dimension table, hence it looks like unordered on the whole.

If we want to return the data by the original order of fact table, just remove the @u option.

At this time, SPL will generate a sequence number recorded in the original fact table, and write the data after they are sorted by this sequence number during partitioning and buffering, and then write it once again into the buffer file after the joining at each segment is done, and finally merge and sort all the buffered data generated in the latter round by the original sequence number. In this way, it is equivalent to doing one more big sorting than joinx@u(), and the performance will be worse. But sometimes the fact table is originally ordered, and this order needs to be used for the next round of calculation, so it must continue to maintain this order.