Performance Optimization - 9.3 [Cluster] Duplicate dimension table

 

Let’s discuss the foreign key association, starting with the small dimension table, that is, the situation where the dimension table can be loaded into node’s memory.

So, which node’s memory should be used to hold the dimension table?

All nodes’ memory should hold one copy.

Since the fact table under a cluster is very large and need all nodes to store, while the dimension table record will be accessed randomly, and hence the fact table on any node may associate with all dimension tables’ records. If the dimension table was only stored on a certain node, it would cause a large amount of network transmission. When the dimension table is small enough, it can be loaded on every node, in this way, the association operation becomes a local operation.

This kind of dimension table that has been duplicated in multiple copies is called the duplicate dimension table.

Since the small dimension table is not large, there is no need to store it in multiple zones. Instead, it can be duplicated and stored in the external storage of every node, and loaded on every node.

A B
1 [“192.168.0.101:8281”,“192.168.0.102:8281”,…, “192.168.0.104:8281”]
2 fork to(4);A1 =file(“product.ctx”).open().import()
3 >env(PRODUCT,B2)
4 =memory(A1,PRODUCT)
5 =file(“orders.ctx”:to(4),A1).open().cursor(p_id,quantity)
6 =A5.switch(p_id,A4)
7 =A7.groups(p_id.vendor;sum(p_id.price*quantity))

First, use the fork framework to load the dimension table on every node and name it as a certain global variable. Then, on the master computer, use the memory()function to create a duplicate dimension table based on the global variables of all nodes. The subsequent operations on the cluster table are basically the same as those on a single machine. For a certain node, the duplicate dimension table is exactly the table sequence in local memory, which can be used, like the dimension table on a single machine, in the functions such as switch() and join() to achieve the addressization. When SPL processes the operation on each node, it will search this dimension table locally for association.

The dimension tables are often reused in multiple computing tasks, and the fork code block can be executed when the node is started, and the program of master computer could just start from the establishment of the duplicate dimension tables. For small multi-layer dimension table, the pre-association can be performed on the node in advance.

By distinguishing the dimension table and the fact table, and by means of the smaller feature of dimension table, we can load the dimension table on every node in advance. Such foreign key association will not generate network transmission.

Database does not distinguish between dimension table and fact table; it generally determines whether to copy the table to the node based on the size of tables. For the situation of associating two tables, there is little difference from the above-mentioned scheme if the database is well-optimized. But when there are more tables or more complex association layer, the database may be “perplexed”. Some distributed databases that are not optimized well will store every table in partitions, resulting in a poorer association performance.