Performance Optimization - 7.3 [Merge and join] Association location

 

What we discussed in the previous two sections is the full-table traversal. In real tasks, the association tables are often filtered with conditions. Of course, we can filter after the association, but it will traverse all the association tables, and will take a lot of time in case of large table. Sometimes the filter can be performed quickly, and the remaining result set after filtering may be very small. The two tables are associated by the primary key, hence using the primary key of the filtered table to search the records of association table can avoid full-table traversal and obtain better performance.

The logic of this algorithm is somewhat like the foreign key association between a small fact table and a large dimension table. However, we can take advantage of the characteristic that both the primary keys of two tables are ordered to deal with the larger filtered result set.

A B
1 =file(“A.ctx”).open() =file(“B.ctx”).open()
2 =A1.cursor(;…)
3 for A2,1000 =B1.find(A3.(id))
4 =join(A3,id;B3,id)
5

In B3, the primary key from A3 is used to take out data. Since the primary key of the records taken out from A3 is ordered, there will be no duplicate record taken out from B1. Moreover, table B is always scanned from front to back, which will be at worst traversed only once. As a result, the repeated traversals that may be caused by adopting this method that we assumed while discussing the large fact table and large dimension table will not occur.

However, this code is only an example to describe the algorithm process. In fact, the find()function cannot be used in such a simply way because it will search from the beginning every time, and will not take advantage of the results of the last search (it only needs to search forward from the last location point). This algorithm that uses the association to locate cannot be implemented easily, and its code is more complicated. Moreover, the cost of this algorithm itself is not low, which may offset the advantage of smaller reading amount. For the specific application scenario, testing is necessary.

SPL encapsulates these operations. Let’s take the primary-sub table as an example.

A
1 =file(“orders.ctx”).open().cursor(dt;area==“CA”)
2 =file(“details.ctx”).open().news(A1,dt,price,quantity)
3 =A2.groups(dt;sum(price*quantity))

After the primary table is filtered, use the filtered primary table to take out the records of associated sub-table. Because the primary table has a one-to-many relationship with sub-table, the news()should be used here to indicate that each record taken from the primary table may correspond to multiple sub-table records. At this time, the field of primary table will be duplicated according to the number of the records of associated sub-table, this action is equivalent to executing joinx().

Alternately, the sub-table can be filtered first, and then take out the records of the associated primary table.

A
1 =file(“details.ctx”).open().cursor(price,quantity;quanity>10)
2 =file(“orders.ctx”).open().new(A1,dt,sum(price*quantity):amount)
3 =A2.groups(dt;sum(amount))

By default, the returned cursor is based on the latter association table (if there is no any filter condition, the number of records in the returned cursor is the same as that in the associated table), in this code, the latter table is the primary table, and the records of the sub-table need to be aggregated first.

These functions can be applied to multi-cursor.