Performance Optimization - 7.1 [Merge and join] Ordered merge


We have mentioned the ordered merge algorithm many times, for example, this algorithm was used when we introduced the appending of ordered composite table in Chapter 2. It can be used to achieve the intersection, union and difference operations for large sets. Taking union as an example, the logic of this algorithm is roughly as follows:

1 func =file(“A.btx”).cursor@b() =file(“B.btx”).cursor@b()
2 =C1.fetch(1)) =D1.fetch(1)
3 for if C2==null && D2==null break
4 else if C2==null >r=D2,D2=D1.fetch(1)
5 else if D2==null >r=C2,C2=C1.fetch(1)
6 else if< >r=C2,C2=C1.fetch(1)
7 else if> >r=D2,D2=D1.fetch(1)
8 else >r=C2,C2=C1.fetch(1),D2=D1.fetch(1)
9 return r
10 return cursor@c(A1)

In this code, both the data tables A and B are ordered by the id field. It will take a record from the cursors C1 and D1 respectively and compare which record has a smaller id, and then save this record having a smaller id for returning, and continue to read the cursor to which this record belongs. This action will be repeatedly performed until all the data of both cursors are all taken out. In this way, the returned cursor is the union of two data sets based on the id field.

To get the intersection, it can be changed to return the data only when both records are equal, and the loop ends when reading one cursor ends. Getting the difference can be implemented similarly. This algorithm can also be extended to the situation with n tables, in that case, the code will be more complex.

The ordered merge algorithm can also be used to perform the join operation. Take the inner join as an example, the logic of this algorithm is as follows:

1 func =file(“A.btx”).cursor@b() =file(“B.btx”).cursor@b()
2 =C1.fetch(1)) =D1.fetch(1)
3 for if C2==null || D2==null break
4 else if< >C2=C1.fetch(1)
5 else if> >D2=D1.fetch(1)
6 Else =new(C2,D2)
7 >r,C2=C1.fetch(1),D2=D1.fetch(1)
8 return D6
9 return cursor@c(A1)

This logic is very similar to the intersection operation. The full join or left join can be achieved in a similar way, in this case, we only need to adjust the intermediate judgment code. Likewise, this logic can also be extended to the situation with n tables.

The ordered merge algorithm can be finished by only traversing both tables once. Even for large data tables, there is no need to buffer the data. If the size of two tables is N and M respectively, the complexity will be O(N+M). To perform set or join operation on unordered data, we need to compare every piece of data each other if no optimization is taken. The complexity will be O(N*M) which is much larger than O(N+M). Usually, the database uses the hashing & partitioning algorithm described in the previous chapter. Although the complexity of this algorithm can decrease by K times (K is the average number of repetitions of the hash value), it is generally still much greater than the ordered merge algorithm. Moreover, for the operation of big data in external storage, the hashing & partitioning method will also lead to read and write buffer files, and may encounter the unlucky situation we have mentioned many times. Therefore, there will be huge performance advantages by using the ordered merge algorithm.

SPL implements the ordered merge algorithm involving the set and join operations:

1 =file(“A.btx”).cursor@b()
2 =file(“B.btx”).cursor@b()
3 =[A1,A2].mergex@u(id)
3 =[A1,A2].mergex@i(id)
3 =[A1,A2].mergex@d(id)

The options @u, @i and @d of the mergex() function represent the union, intersection and difference respectively.

1 =file(“A.btx”).cursor@b()
2 =file(“B.btx”).cursor@b()
3 =joinx(A1,id;A2,id)
3 =joinx@1(A1,id;A2,id)
3 =joinx@f(A1,id;A2,id)

The joinx() function without option, with options @1 and @f represent the inner join, left join and full join respectively.

Both mergex()and joinx() assume that the cursor, as the parameter, is already ordered. Executing these two functions for an unordered cursor will result in incorrect calculation results.

Both functions are delayed cursors, and further calculations need to be defined.

In addition to the foreign key association, there are also two types of associations in common join operation, i.e., homo-dimension tables association, and primary-sub table association. The former is the one-to-one association, and the two tables are associated by the primary key; while the latter is the one-to-many association, and the association occurs between the primary key of primary table with the previous part of primary keys of sub-table. Both the two tables in these two associations are related to the primary key. If the data is ordered by the primary key, the low-complexity ordered merge algorithm can be adopted. When the large dimension table is ordered by the primary key, the foreign key association algorithm introduced in the previous chapter can also achieve high performance.

For the database, being ordered by the primary key is the key to achieve high-performance join operation. It is enough for the database to be ordered by the primary key only, and the cost of meeting this premise is not very high. In SPL, we will assume that all the data tables in the external storage are by default ordered by the primary key, and only few data tables used for special search targets will be sorted differently.

The definition of join operation in relational algebra is too simple (Cartesian product and then filtering) and does not involve the primary key. If the definition is strictly implemented, the ordering characteristics of association keys cannot be assumed in theory, hence these optimization algorithms cannot be used. Moreover, according to the theory of relational algebra, it is difficult to ensure that the data table is ordered by any field, and even if we want to improve the performance through the engineering optimization, it is also difficult to reach a relatively high level.