Improving JOIN Performance: The Extension of Foreign Key Pointerization

Let’s continue our discussion of foreign-key-oriented JOINs using the example in the previous article.

If a fact table is too big to be held by the memory, the foreign key pointerization strategy will fail, because the external storage cannot store the pre-calculated pointers.

Generally a dimension table referred by a foreign key is small, while an ever-increasing fact table is large. By loading the smaller dimension table into the memory, we can create temporary foreign key pointers.

1. P=file(“products.txt”).import() Import the products table P

2. P.index(id) Build index on P_’_s primary key, the id field, to accelerate data search

3. S=file(“sales.txt”).cursor() Create cursor S based on the sales table and retrieve data from the cursor in a stream style

4. S.switch(productid,P:id) Switch values of S_’s productid field to P’_s records by matching the latter’s primary key while data is flowing in

5. S.sum(quantity*productid.price) Calculate sales amount

The first two steps are the same as those for an in-memory computation. In step 4, pointers are created while data is streaming in, but they can’t be retained for reuse. Hash value calculations and comparisons are still needed for the next JOIN. The performance would thus be poorer than that of the in-memory strategy. Compared with the hash partitioning algorithm in terms of the amount of computations, this one doesn’t need to calculate hash values of the dimension table’s primary key. The algorithm gains an advantage especially when the hash index on the dimension table is repeatedly used, but its ability in increasing the overall performance is limited because of the generally small size of a dimension table. Yet it is as capable of handling all foreign key joins at a time and as easy to be paralleled as an in-memory algorithm, thus producing much better real-world performance than the hash partitioning algorithm.

From this algorithm, we have its variation – foreign key numberization.

By converting values of the dimension table’s primary key into natural numbers beginning from 1, we can identify records in a dimension table by their numbers without the need of calculating and comparing hash values, obtaining high performance the in-memory foreign key pointerization strategy can give.

1. P=file(“products.txt”).import() Import the products table P where values of the primary key, the id field, are numbers

2. S=file(“sales.txt”).cursor() Create cursor S according to the sales table to retrieve data from it in a stream style

3. S.switch(productid,P:#) Switch values of S_’s productid field to P’_s records by matching the latter’s primary key numbers while data is flowing in

4. S.sum(quantity*productid.price) Calculate sales amount

With a dimension table where the primary key values are numberized, the creation of hash index (step 2) is unnecessary.

The foreign key numberization is, in essence, is a way to achieve pointerization on the external storage. The strategy converts foreign key values in the fact table into numbers. This is similar to the creation of pointers during an in-memory computation, and the pre-calculation result can be reused. Note that the foreign key values in a fact table need a good sort-out whenever a major change happens to the dimension table; otherwise the correspondence could be messed up. But in real-life practices, the re-matching is rare because a dimension table is relatively stable and most of the changes to it are appending and modification, instead of deletions.

SQL is based on the concept of unordered sets. This means SQL-based databases can’t take advantage of the foreign key numberization technique to create a shortcut mechanism for locating records. Even if the foreign key is deliberately numberized, they can’t recognize the change but just go into the calculation and comparison of hash values. They can only employ the hash-based search.

But what if a dimension table is too large to be held by the memory?

In the above algorithm, accesses to the fact table are continuous but accesses to the dimension table are not. When we talked about hard disk characteristics, we mentioned that hard disks respond slowly to random accesses. If we put a dimension table in the external storage and execute the above algorithm, the performance will be terrible – much poorer than the performance of the hash partitioning algorithm, even that of sorting and merging the two tables.

It is the time to turn to the cluster computing.

We prepare multiple machines, divide a large dimension table into multiple segments and put these segments on those machines to make a cluster dimension table. Then we can use the above algorithm to handle multiple foreign key joins at one time with parallel processing. The foreign key numberization strategy also applies to a cluster dimension table.

Remember that the memory is always access-friendly, no matter it is random accesses or high-frequency, small-amount accesses. Data in a cluster dimension table, though stored in memories, depends on network to be transmitted, during which the network latency is a problem for high-frequency, small-amount accesses. While one record is retrieved for processing each time on a single machine, with a cluster we need to retrieve a batch of records from a fact table at a time during the traversal, collect values of the joining key and send them to the targeted node to get the associated field values in the dimension table.

The key to high-performance JOINs is ensuring that the dimension table(s) is/are stored in the memory. Contemporary computers have large memories, so most of the time the dimension table(s) can be wholly loaded into the memory of a single machine. On a few occasions when the dimension table(s) is/are too large, we can create a cluster dimension table to enable high-performant random accesses. If a cluster fails to hold the huge dimension table(s), the only choice is to turn back to the slow hash partitioning algorithm.