Performance Optimization - 7.2 [Merge and join] Merge in segments


There is an inconvenience in the ordered merge algorithm for big data, that is, the data need to be read one by one from two (or more) cursors before comparison. Such logic cannot directly achieve the parallel computing in segments. Since this logic cannot ensure that the associated key values of two tables are distributed synchronously in two corresponding segments, the key value in the second segment of table A may correspond to the third segment of table B.

We can once again use the ordering characteristic of the primary key to solve this problem.

When the tables A and B are associated by the primary key, take A as the reference table to divide it into segments. After that, we can quickly take out the start and end values of the primary key of each segment, and then use the primary key interval of each segment as a condition to search in table B. Since the table B is also ordered by primary key, and the records in the interval into which the primary key value falls are stored continuously, these records can also be quickly located to generate the cursor. In this way, we can achieve the synchronous segmentation and the parallel computing in segments:

1 =file(“A.ctx”).open() =file(“B.ctx”).open()
2 =[-inf()] | 9.(A1.cursor(;;~+1:10).fetch(1).id)| [inf()]
3 =10.(“id>=”/A2(#)/"&&"id<"A2(#+1))
4 fork to(10) =A1.cursor(;;A4:10)
5 =B1.cursor(;${A3(A4)})
6 =joinx(B4,id;B5,id)
7 =…
8 =A4.conj().…

In A2, the id value of the first record of each segment in table A will be firstly read, and A3 will use these segmentation points to form the interval condition. After that, in each thread, table A is still segmented normally, and table B is segmented with corresponding condition. In this way, it can ensure that the key values in two tables are synchronously corresponding, and the calculation will continue after the associated cursor of segment cursor is obtained.

SPL provides corresponding function with which associable multi-cursors can be generated:

1 =file(“A.ctx”).open() =file(“B.ctx”).open()
2 =A1.cursor@m(;;4) =B1.cursor(;;A3)
3 =joinx(A2,id;B2,id)

In A2, the multi-cursor of table A is generated normally, and in B2, the multi-cursor of table B can be generated based on A2. Because the primary key (the name of field with # while executing the create() can be defined when the composite table is created, there is no need to explicate the field name here. Instead, the primary keys of the two tables will be directly corresponded for searching (the primary key field names of two tables can be different). The joinx() in A3 will also return a multi-cursor.

SPL also supports such aligned synchronization segmentation mechanism for the ordered bin files and text files, but the associated fields cannot be omitted. Since the working principle is the same, examples will not be given here.

There is another problem.

Since the primary key is unique, it is impossible to assign two records with the same primary key to two segments while segmenting. However, in case of the primary-sub table association, the association keys of the sub-table are not all primary keys, and there may be duplicate values, hence the records with the same association key may appear in two natural segments. As a result, when the parallel segmentation is to be performed in case of the primary-sub table association, the primary table should be used as the reference table, the sub-table should follow the primary table to do segmentation instead of reversing the order at will.