Performance Optimization - 7.2 [Merge and join] Merge in segments

 

Performance Optimization - 7.1 [Merge and join] Ordered merge

There is an inconvenience in the ordered merge algorithm for big data, that is, the data need to be read one by one from two (or more) cursors before comparison. Such logic cannot directly achieve the parallel computing in segments. Since this logic cannot ensure that the associated key values of two tables are distributed synchronously in two corresponding segments, the key value in the second segment of table A may correspond to the third segment of table B.

We can once again use the ordering characteristic of the primary key to solve this problem.

When the tables A and B are associated by the primary key, take A as the reference table to divide it into segments. After that, we can quickly take out the start and end values of the primary key of each segment, and then use the primary key interval of each segment as a condition to search in table B. Since the table B is also ordered by primary key, and the records in the interval into which the primary key value falls are stored continuously, these records can also be quickly located to generate the cursor. In this way, we can achieve the synchronous segmentation and the parallel computing in segments:

A B
1 =file(“A.ctx”).open() =file(“B.ctx”).open()
2 =[-inf()] | 9.(A1.cursor(;;~+1:10).fetch(1).id)| [inf()]
3 =10.(“id>=”/A2(#)/"&&"id<"A2(#+1))
4 fork to(10) =A1.cursor(;;A4:10)
5 =B1.cursor(;${A3(A4)})
6 =joinx(B4,id;B5,id)
7 =…
8 =A4.conj().…

In A2, the id value of the first record of each segment in table A will be firstly read, and A3 will use these segmentation points to form the interval condition. After that, in each thread, table A is still segmented normally, and table B is segmented with corresponding condition. In this way, it can ensure that the key values in two tables are synchronously corresponding, and the calculation will continue after the associated cursor of segment cursor is obtained.

SPL provides corresponding function with which associable multi-cursors can be generated:

A B
1 =file(“A.ctx”).open() =file(“B.ctx”).open()
2 =A1.cursor@m(;;4) =B1.cursor(;;A2)
3 =joinx(A2,id;B2,id)

In A2, the multi-cursor of table A is generated normally, and in B2, the multi-cursor of table B can be generated based on A2. Since the primary key (field name with # while executing the create()) can be defined when creating the composite table, there is no need to explicate the field name here. Instead, search will be directly performed according to corresponding primary key of the two tables (different names are allowed for the primary key field of two tables). The joinx() in A3 will also return a multi-cursor.

There is another problem.

Since the primary key is unique, it is impossible to assign two records with the same primary key to two segments while segmenting. However, in case of the primary-sub table association, the association keys of the sub-table are not all primary keys, and there may be duplicate values, hence the records with the same association key may appear in two natural segments. As a result, when the parallel segmentation is to be performed in case of the primary-sub table association, the primary table should be used as the reference table, the sub-table should follow the primary table to do segmentation instead of reversing the order at will.


Performance Optimization - 7.3 [Merge and join] Association location
Performance Optimization - Preface