Performance Optimization - 9.3 [Cluster] Duplicate dimension table
Performance Optimization - 9.2 [Cluster] Multi-zone composite table of cluster
Let’s discuss foreign key association, starting with the situation of small dimension table, that is, the situation where the dimension table can be loaded into node’s memory.
So, which node’s memory should be used to hold the dimension table?
All node’s memory should hold one copy.
Since the fact table under a cluster is very large, it would be stored on multi nodes. The records of the dimension table will be accessed randomly, so the fact table on any node may be associated with all dimension table records. If the dimension table is stored fixedly on a certain node, it will result in large amount of network transmission. When the dimension table is small enough, it can be loaded on every node, so that the association operation becomes a local operation.
This dimension table that is copied multiple times is called duplicated dimension table.
Since the small dimension table is not large, there is no need to store it in multiple zones. Instead, it can be duplicated and stored in the external storage of every node and loaded on every node.
A | B | |
---|---|---|
1 | [“192.168.0.101:8281”,“192.168.0.102:8281”,…, “192.168.0.104:8281”] | |
2 | fork to(4);A1 | =file(“product.ctx”).open().import() |
3 | >env(PRODUCT,B2) | |
4 | =memory(A1,PRODUCT) | |
5 | =file(“orders.ctx”:to(4),A1).open().cursor(p_id,quantity) | |
6 | =A5.switch(p_id,A4) | |
7 | =A7.groups(p_id.vendor;sum(p_id.price*quantity)) |
First, use the fork framework to load the dimension table on every node and name it as a global variable. Then, on the master computer, use the memory()function to create a duplicated dimension table based on the global variables of all nodes. The subsequent operations on the cluster table are basically the same as those on a single machine. For a certain node, the duplicated dimension table is the table sequence in local memory and can be used in the functions such as switch() and join() to implement addressization, just like the dimension table on a single machine. When SPL handles the operation on each node, it will search for this dimension table locally for association.
Since the dimension tables are often reused in multiple computing tasks, the fork code block can be executed when the node is started, while the program of master computer only needs to be executed from the establishment of the duplicated dimension tables. For smaller multi-layer dimension tables, pre-association can be performed on the node in advance.
By distinguishing between dimension table and fact table, and taking advantage of the characteristic that the dimension table is smaller in size to load it on every node in advance, the foreign key association will not generate network transmission.
Database does not distinguish between dimension table and fact table, and generally decides whether to copy to a node based on the size of table. For the situation of associating two tables, there is little difference from the above-mentioned scheme if the database is well-optimized. But when there are more tables or the association layer is more complex, the database may become “perplexed”. Some distributed databases that are not well optimized will store every table in partitions, resulting in poor association performance.
Performance Optimization - 9.4 [Cluster] Segmented dimension table
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL