Improving JOIN Performance: Foreign Key Pointerization

After redefining the JOIN operations, let’s find out how to increase their performance. We begin from the foreign-key-oriented joins.

Here are two tables:

The products table with the following fields:

id

name

price (per unit)

The sales table with the following fields:

seq

date

productid

quantity

The productid field in the sales table is the foreign key pointing to id field, the products table’s primary key.

To find out the amount of each sales record (to make the discussion simpler, we won’t specify any condition), we can express the query in SQL:

SELECT SUM(sales.quantity*products.price) FROM sales JOIN products ON sales.productid=products.id

With the Cartesian-product-based JOINs, the task can be only handled through a two-layer full traversal loop, which is computation-intensive. The common technique a relational database uses to optimize the query is hash partitioning. It calculates hash values of the associated fields in the two tables respectively, partition each table according to different hash values, and traverse each pair of partitions of same hash values in the two tables to compare records. This considerably reduces the complexity but a lot of hash value calculations and comparisons are needed. There are numerous discussions about the algorithm on the internet, so we won’t go into details.

Here’s the query written with the simplified JOIN syntax we introduced in the previous articles:

SELECT SUM(quantity*productid.price) FROM sales

This implies that a better optimization plan is possible.

Suppose all data can be loaded into the memory, and then we can optimize the query by pointerizing the foreign key.

By switching values of the foreign key field, productid, in the fact table sales to pointers referring to records of the dimension table products – that is, converting the productid values to records of the products table – we can directly reference a field of the products table for calculation.

It’s inconvenient to describe the detailed computing process in SQL, so we demonstrate it in procedural syntax with a file used as the data source:

1. P=file(“products.txt”).import() Import the products table P

2. P.index(id) Build an index on _P’_s primary key, the id field, to accelerate data search

3. S=file(“sales.txt”).import() Import the sales table S

4. S.switch(productid,P:id) Switch values of productid field in S to _P’_s records by matching the latter’s primary key

5. S.sum(quantity*productid.price) Calculate sales amount; since values of productid field have been converted into objects, we can directly reference the price field

In the above algorithm, step 2 creates an index on the primary key by using hashing method to calculate hash values for the id field. In step 4, the conversion into pointers involves calculating hash values for the productid field and matching them with _P’_s hash index. In terms of the amount of computations, the foreign key pointerization strategy and the conventional hash partitioning method are almost neck and neck in handling one JOIN.

If data can be wholly loaded into the memory, the advantage of creating pointers is that the pointers can be reused. This means we just need to do the calculation and comparison of hash values once. Later the results are ready to be used by a join operation on the same two associated fields, significantly boosting performance. The relational algebra doesn’t have the concept of object pointer, and Cartesian-product-based JOINs can’t ensure the uniqueness of the records in a table referred by the foreign key of another table. Under this circumstance, a foreign key cannot be pointerized and each join needs the calculation and comparison of hash values.

If there are multiple foreign keys in a fact table that refer to multiple dimension tables, the conventional hash partitioning strategy can only handle one JOIN at a time. For a number of (N) JOINs, each has to perform all the operations and needs to store its intermediate result for use in the next JOIN. The computing process is intricate and data will be traversed many times. The foreign key pointerization strategy, however, just traverses the fact table once without generating intermediate results, creating a clear and concise computation.

Memory is parallelism-friendly, but the hash partitioning join algorithm isn’t easily paralleled. Yes, we can segment tables to calculate hash values in parallel. But the strength of the parallel processing will be offset by the resource conflict that occurs when records with same hash values in two corresponding segments are being located and gathered for comparison. Under the foreign-key-oriented join model, two tables being joined are not equal by defining one as the fact table and the other as the dimension table, and we just need to segment the fact table to perform parallel processing.

By reforming it with the foreign key pointerization concept, the hash partitioning strategy’s ability of handling multiple foreign key joins and performing parallel processing can be increased. Some database products can do this kind of practical optimization. But when more than two tables are JOINed and when multiple JOIN types are involved, it isn’t easy for the database to judge which table should be identified as the fact table to be traversed and processed in parallel and which tables should be used as the dimension tables on which the hash indexes are built, making the optimization effect unguaranteed. So quadratic decline in performance often occurs when the number of tables being JOINed increases (performance significantly decreases when 4 or 5 tables are being JOINed but there isn’t significant increase in the size of the result set). By introducing the foreign key concept to the JOIN model, the foreign-key-oriented joins define fact table and dimension tables clearly. The increase of the numbers of tables to be JOINed will only lead to a linear decline in performance.

At present in-memory database technologies are hot. But our analysis shows that it’s almost impossible for the SQL-based in-memory databases to handle JOIN operations fast.