SQL Performance Enhancement: Big Table Associations on the Primary Key
The essay analyzes the fundamental reason behind slow SQL statements and provides the solution to make improvements through code samples. Looking SQL Performance Enhancement: Big Table Associations on the Primary Key for details.
Problem description
It is common to associate two tables, like an orders table and an orders details table, according to the primary key. Regardless of the related field, databases will use the HASH JOIN to achieve an association. Yet the algorithm is very slow when associating two big tables because it is, in essence, a kind of in-memory algorithm. When there is a huge volume of data, it needs to be first divided into heaps and buffered before a join between big tables can be converted into joins between a number of small tables. This causes high I/O cost, and repeated buffering if the HASH function is not appropriately selected.
Solution
Ⅰ. Order-based MERGE algorithm
If both of the big tables are ordered by the primary key, we can achieve the join using order-based MERGE. The algorithm traverses both tables in order without any buffering, which considerably reduces I/O load. In terms of complexity, the order-based MERGE is of additive level and HASH JOIN is on a multiplicative level.
If both tables are unordered, they need to be pre-sorted by the primary key. Sorting is resource-consuming and compromises performance, making the overall performance even worse than the conventional algorithm. Luckily, join operations by the primary key (or part of the primary key) are frequently seen. The order-based MERGE algorithm is widely applicable as it can handle such a commonly seen type of computing scenario. Actually, all meaningful joins involve the primary key (we’ll talk about this separately).
Pre-sorting by the primary key is a slow yet one-off. We just need to store the sorting result only, without any redundant data.
Ⅱ. Order-based appending
The newly-added data isn’t always ordered by the primary key. It should not be directly appended at the end of the already ordered data. And it will be time-consuming if we combine the new data with the existing data and do a full sorting.
Instead, we can sort the newly-added data first and generate a new ordered table based on it and the old ordered data using the cheap order-based MERGE algorithm. Its overall degree of complexity amounts to reading and writing the whole data once. The method can avoid the frequent temporary external buffering that happens in the ordinary full sorting and thus get higher performance.
Further, we can keep a small-scale ordered table (referred to as patch data below), which will merge with the sorted newly-added data while keeping the old ordered data unchanged. Over time when the patch data gets accumulated, it will be merged with the old ordered data. A join operation retrieves both the old ordered data and the patch data, merges them, and join the result with another table. The performance decreases a little compared with handling one table of ordered data, but the data orderliness can still make the computation much faster.
The time when to merge the patch data with the old ordered data is related to the cycles when data should be added. If data is added each day, we can do the merge each month. The patch data stores data in one month at most and the existing ordered data contains all data one month ago. The patch data could be far smaller than the old ordered data, so the daily size of to-be-merged data is small and the data appending is fast. As we do the whole-data order-based merge once a month, it is acceptable that it takes a little long to complete.
Code sample
Step 1. Data pre-processing – sort data and store the ordered result set.
Here we only pre-processes table A (the pre-processing code for table B is similar).
A |
|
1 |
=file("A-original.ctx") |
2 |
=A1.open().cursor().sortx(a).new(a,…) |
3 |
=file("A.ctx").create(#a,…) |
4 |
=A3.append@i(A2) |
A1: Define the original composite table A-original.ctx.
A2: Open the composite table, create a cursor based on it, sort the cursor by field a, and return a new cursor where field a is in the leading position.
A3: Create a new composite table A.ctx, where field a is preceded by the pound sign #. This means the table is ordered by that field.
A4: Write the ordered data into composite table A.ctx.
Step 2. Ordered-merge and join.
A |
|
1 |
=file("A.ctx").open().cursor(a,…) |
2 |
=file("B.ctx").open().cursor(b,…) |
3 |
=joinx(A1:A,a;A2:B,b) |
4 |
… |
A1,A2: Open composite tables A.ctx and B.ctx, and create cursors based on them.
A3: Perform order-based merge to inner-join two cursors by field a and field b, and return a cursor. Similarly, joinx@f performs a full join and joinx@1 performs a left join.
A4: Perform other operations on the result cursor returned by the order-based merge.
Step 3. New data appending.
Suppose the newly-added data for table A is stored in A_new.btx and has same field names in the same order as A.ctx has.
A |
B |
|
1 |
=file("A.ctx").open() |
|
2 |
=file("A_new.btx").cursor@b().sortx(a) |
|
3 |
if (day(now())==1) |
>A1.reset() |
4 |
>A1.append@a(A2) |
A1: Open the composite table A.ctx.
A2: Define cursor based on A_new.btx and sort it. As the daily volume of new data is small, the sorting is a fast, in-memory operation though sortx function is used.
A3: Check whether the current date is day 1, and execute A4, if it isn’t, to merge the record with the patch data using append@a; or execute B3, if it is, to merge the existing ordered data with the patch data using reset function.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL