How to make JOIN run faster?
JOIN has always been a knotty and long-pending problem in the performance optimization of database. The performance of a query that was originally very fast will drop sharply once several JOINs are involved and, the larger and the more tables participating in the JOIN, the harder it is to improve performance.
In fact, the key to making JOIN run faster is to categorize the JOINs. After categorizing, we can make use of the characteristic of different types of JOINs to optimize the performance.
Categorize the JOINs
Those having SQL development experience all know that the vast majority of JOINs are equivalent in value, that is, the associated condition is an equation. The non-equivalent JOINs are much rarer, and in most cases, such JOINs can be converted to equivalent JOIN, so we only discuss the equivalent JOIN in this article.
Equivalent JOINs can be divided into two categories: foreign key association and primary key association.
First, let’s see the foreign key association, it refers to using the non-primary key fields of one table to associate the primary key of the other. In this context, we call the former table the fact table, and the latter table the dimension table. For example, in the figure below, the orders table is a fact table, and the customer table, product table and employee table are the dimension table.
The foreign key table is a many-to-one relationship, and is asymmetric. The positions of fact table and dimension table cannot be interchanged. It should be noted that the primary key mentioned here refers to the logical primary key, i.e., the field (or composite fields) in the table that has a unique value and can be used to uniquely determine a certain record, and the primary key is not necessarily created on the database table.
Let’s now move on to the primary key association. This association refers to using the primary key of one table to associate the primary key or part of primary keys of the other, such as the association between customer table and VIP customer table, and between orders table and details table in the figure below.
The two tables on the left are associated by respective primary key, and they are the homo-dimension table of each other. The orders table uses its primary key to associate part of primary keys of details table, we call the orders table the primary table and the details table the sub-table.
The homo-dimension table is a one-to-one relationship, and the two tables are symmetrical and have the same status. The primary-sub table is a one-to-many relationship, and they are asymmetric and has a clear direction.
If you are attentive, you will find that the two types of JOINs mentioned above involve the primary key. For JOINs that do not involve primary key, it will result in a many-to-many relationship, which has no business significance in most cases. In other words, the said two types of JOINs cover almost all JOINs that have business significance. If we can make use of this feature of JOIN that always involves primary key to optimize the performance, we will be able to deal with the two JOINs, it actually means that we are able to solve most of the performance problems of JOIN operation.
Unfortunately, SQL’s definition of JOIN does not involve the primary key, it only requires performing a Cartesian product operation on two tables, and then filtering by a certain condition. Although this definition is simple and broad, and can describe almost everything, if JOIN is implemented strictly according to this definition, there is no way to make use of the feature of primary key in performance optimization.
SPL changes the definition of JOIN. In SPL's definition, the two types of JOINs can be handled separately. By making use of the feature of primary key, the amount of operation could be reduced, and the goal of performance optimization could be achieved accordingly.
Let's see how SPL works in detail.
Foreign key association
If both the fact table and dimension table are not very large, and can be loaded into memory, SPL provides the foreign key addressization method: first convert the foreign key field value of fact table to the address of its corresponding dimension table record, and then fetch the value with this address when the dimension table field is referenced.
Let's take the above orders table and employee table as an example. Assume that both tables have been read into memory, the working mechanism of foreign key addressization will be as follows: for the eid field of a record (r) of orders table, first find the record corresponding to this eid field value in the employee table, and get the address (a) of the record in memory, and then replace r’s eid field value with a. Having converted all records in orders table in this way, the foreign key addressization is done. At this time, when the record (r) of orders table needs to reference the field of employee table, we can directly use the address (a) stored in the eid field to fetch the record and field of employee table, which is equivalent to obtaining the employee table field in constant time, and there is no need to search the employee table.
To achieve this, we could, at the system startup, read the fact table and dimension table into the memory and perform the foreign key addressization in one go, i.e., the pre-association. In this way, we can directly use the address in the foreign key field of fact table to fetch the record of dimension table in subsequent association calculation, thereby achieving high-performance JOIN calculation.
For the detailed principles of foreign key addressization and pre-association, please visit: Performance Optimization - 6.1 [Foreign key association] Foreign key addressization.
Usually, SQL uses the HASH algorithm to perform in-memory join, this method needs to calculate the HASH value and compare, and the performance will be much worse than that of direct reading from the address.
The reason why SPL can implement foreign key addressization is that it makes use of the feature that the association field of dimension table is the primary key. In the above example, the association field eid is the primary key of employee table, which is unique. Each eid in the orders table will only correspond to one employee record, so each eid can be converted to the address of the employee record that it uniquely corresponds to.
For SQL, however, since there is no agreement on the primary key in SQL’s definition of JOIN, it cannot determine that the record of dimension table associated with the foreign key of fact table is unique, and it may associate with multiple records. For the records of orders table, there is no way for the eid value to uniquely correspond to an employee record, so the foreign key addressization cannot be implemented. Moreover, SQL has no data type for recording address. As a result, the calculation of HASH value and comparison need to be performed in each association.
When only two tables are joined, the difference between foreign key addressization and HASH association is not very obvious. The reason is that JOIN is not the ultimate goal, and there will be many other operations after JOIN, and the proportion of time consumed by JOIN operation itself is relatively small. But in fact, the fact table often has multiple dimension tables, and even multi-layer dimension tables. For example, the order is associated with the product, the product is associated with the supplier, the supplier is associated with the city, and the city is associated with the country, and so on. When there are many association tables, the performance advantage of foreign key addressization will be more obvious.
The following test shows the performance difference between SPL and Oracle when the number of association tables is different, from which we can see that the advantage of foreign key addressization is quite obvious in the case of more tables.
For more information about the test, please visit: Performance optimization skill: Pre-Joining.
In the case that only the dimension table can be loaded into memory, and the fact table needs to be stored in external storage for its large size, SPL provides another method: foreign key sequence-numberization. This method works in a way that convert the foreign key field value of fact table to the sequence number of its corresponding record of dimension table in advance, and then read the new fact table records in batches during the association calculation, and finally use the sequence number to fetch the corresponding dimension table record.
Let’s take the above orders table and product table as an example, assuming that the product table has been loaded into memory, and the orders table is stored in external storage. The process of foreign key sequence-numberization: first read a batch of order data, and let the pid of one of the records (r) correspond to the i-th record of product table in memory, and then convert the pid field value in r to i. After this batch of order records are converted in this way, read the order data in batches from the external storage when performing the association calculation. For the record r, we can directly take out the corresponding record from the product table in the memory according to the pid value. This method also avoids the search action.
For more information about the principle of foreign key sequence-numberization, please visit: Performance Optimization - 6.3 [Foreign key association] Foreign key sequence-numberization.
The database usually reads the small table into memory, and then reads the data of large table in batches, and finally uses the hash algorithm to perform in-memory join operation. Therefore, it needs to calculate the hash value and compare. As for SPL, it uses the sequence number positioning to read directly, and there is no need to perform any comparison, and hence the performance advantage is obvious. Although it takes a certain cost to convert the foreign key field of fact table to sequence numbers in advance, this pre-calculation only needs to be done once and the pre-calculated result can be reused in multiple foreign key associations.
The foreign key addressization of SPL also makes use of this feature that the association field of dimension table is the primary key. As mentioned earlier, since there is no agreement on the primary key in the definition of JOIN in SQL, it cannot use this feature to perform foreign key sequence-numberization. In addition, SQL uses the concept of unordered set, even if we performed the foreign key sequence-numberization in advance, the database could not make use of this feature because it cannot use the sequence number fast positioning mechanism on the unordered set, and the fastest way is to use the index to search. Moreover, the database does not know that the foreign key has been sequence-numberized, and will still calculate the hash value and compare.
The following test compares the speed of SPL and Oracle in performing the association calculation between large fact table and small dimension table under different number of parallel calculations. The result shows that SPL runs 3 to 8 times faster than Oracle. See the figure for test results:
For more information about the test, please visit: Performance optimization skill: Numberizing Foreign Key.
If the dimension table is also large and needs to be stored in external storage, while the fact table is small and can be loaded into memory, SPL provides the big dimension table search mechanism. If both the dimension table and fact table are large, SPL uses the one-side partitioning algorithm. For the case where dimension table is filtered and then associated, SPL provides methods such as index reuse and aligned sequence.
When the amount of data is so large that distributed computing is required, if the dimension table is small, SPL adopts the mechanism of duplicate dimension table to copy the dimension table in multiple copies on the cluster nodes; if the dimension table is large, the cluster dimension table method is used to ensure random access. Both methods can effectively avoid the Shuffle action. In contrast, the dimension table cannot be distinguished under the SQL system, and the HASH split method requires that the Shuffle action needs to be performed on both tables, and hence the network transmission volume is much larger.
Primary key association
Tables involved in the primary key association are generally relatively large and need to be stored in the external storage. In view of this situation, SPL provides the ordered merge algorithm: pre-store the tables in external storage in order by primary key, and take out the data in sequence to perform merge calculation when associating.
Let's take the inner join of the customer table and VIP customer table as an example, and assume that the two tables have been stored in external storage orderly by the primary key cid. When associating, read the records from the cursors of two tables and compare the cid values one by one. If the cid values are equal, merge the records of two tables into one record of result cursor and return, if not, read the record of the cursor with the smaller cid value again and continue to compare. Repeat these actions until the data of any table are completely fetched, at which time, the returned cursor is the result of JOIN.
For the association of two large tables, the database generally adopts the hashing & partitioning algorithm, the complexity of which is quadratic. In contrast, the complexity of ordered merge algorithm is linear, so the performance of the latter algorithm will be much better. Moreover, when the database performs the operation of big data in external storage, the hashing & partitioning algorithm will lead to reading and writing the buffer files, while the ordered merge algorithm only needs to traverse the two tables in turn, without the need to buffer data on external storage, therefore, it can greatly reduce the amount of IO and has significant performance advantages.
Although the cost of sorting by primary key in advance is high, it can be done in one go and, after that, the merge algorithm can always be used to implement JOIN. In this way, the performance can be greatly improved. Furthermore, SPL also provides a scheme to keep the data as a whole in order even when performing append action.
The feature of this type of JOIN is that the association field is the primary key or part of primary keys, the ordered merge algorithm is designed exactly based on this feature. Because no matter whether it is the homo-dimension table or primary-sub table, the association field will always be the primary key rather than other fields, so all we need to do is to sort and store the association tables in order by primary key, and there will never be redundancy. The foreign key association, however, does not have this feature, and hence the ordered merge algorithm will fail. Specifically, since the association field of fact table is not the primary key, there will be multiple foreign key fields to participate in the association, it is impossible to make the same fact table sorted orderly by multiple fields at the same time.
The definition of JOIN in SQL does not distinguish the type of JOINs, and does not assume that a certain JOIN is always for primary key, so there is no way to make use of the feature of primary key association at the algorithm level. Moreover, as mentioned earlier, SQL is based on the concept of unordered set and, since the database will not deliberately ensure the physical order of data, it is difficult to perform the ordered merge algorithm.
Another advantage of ordered merge algorithm is that it is easy to perform the parallel computing in segment. Let's take the association of orders table and details table by oid as an example. If each of the tables is roughly equally divided into 4 segments according to the number of records, the oid in the second segment of orders table may appear in the third segment of details table, and such misalignment will lead to incorrect calculation result. To solve this problem, SPL once again makes use of the ordered characteristics of primary key oid, and provides a synchronous segmentation mechanism, which works in a way that first divide the ordered orders table into 4 segments, and then find the oid values of the start and end records of each segment to form 4 intervals, and finally divide the details table into 4 synchronized segments. In this way, the segments of two tables correspond to from each other, and there will be no misalignment during parallel computing. Since the details table is also ordered by oid, a quick location can be made according to the start and end oids without reducing the performance of ordered merge.
For the principal of ordered merging, and synchronous parallel computing in segment, visit: SPL Order-based Merge Join.
If we use the conventional HASH partitioning technology to implement parallel computing, it is relatively difficult, for the reason that when the multi-thread HASH partitioning is performed, these threads need to write data to a certain partition at the same time, resulting in the conflict of shared resource; moreover, when the association of a certain group of partitions is performed in the next step, a large amount of memory space will be consumed, making it impossible to perform more parallel computing.
An actual test for the primary key association of two large tables (visit: Performance optimization skill: Ordered MERGE for details), proves that under the same condition, SPL is nearly 3 times faster than Oracle:
Besides the ordered merge algorithm, SPL also provides more high-performance algorithms to comprehensively improve the calculation speed of primary key association JOIN. For example, the attached table mechanism, this mechanism can store multiple tables in an integrated manner, which not only reduces the data storage amount, but also is equivalent to performing the pre-association and avoids the comparison; the association location algorithm, is to filter first and then associate, which can avoid full table traversal, and obtain better performance, and so on.
When a cluster with multiple servers is required to cope with the continuous increase of data amount, SPL provides the multi-zone composite table mechanism. This mechanism distributes the large tables that need to be associated to the cluster nodes according to the primary key. Since the data with the same primary key is on the same node, the data transmission between nodes, and the shuffle action are avoided.
Review and summary
Reviewing the above two types of JOINs performed in various scenarios, we can conclude that using the high-performance algorithms in SPL according to specific situation can use the feature of each JOIN to speed up and make JOIN run faster. On the contrary, for so many JOIN scenarios mentioned above, SQL can only deal with them in a general way, so there is no way to implement these high-performance algorithms for different JOIN features. For example: when both the fact table and dimension table are loaded into the memory, SQL can only calculate the HASH values according to the key value and compare, and cannot use the address to directly correspond; since the data table of SQL is unordered, the ordered merge cannot be performed when large tables are associated by primary key, and only the HASH partitioning method can be used, and hence multiple buffering may occur, and the performance is uncontrollable to a certain extent.
In terms of parallel computing, although it is easy for SQL to implement parallel computing in segment for single-table computing, it generally has to perform the fixed segmentation in advance in multi-table association operation, and hence it is difficult to achieve synchronous dynamic segmentation, and this means that it is difficult to temporarily decide the number of parallel computing based on the workload of machine.
The is also the case for cluster operation. Since SQL does not distinguish between dimension table and fact table in theory, it will inevitably generate the HASH Shuffle action that takes up a lot of network resources when implementing the JOIN between large tables. Once there are too many cluster nodes, the delay caused by network transmission will exceed the benefits of more nodes.
SPL designs and uses new operation and storage models and can solve these problems of SQL in principle and implementation. For different types of JOIN and scenarios, programmers can use the above high-performance algorithms in a targeted manner to obtain faster computing speed and make JOIN run faster.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL