Performance Optimization - 9.1 [Cluster] Computation and data distribution

 

Performance Optimization - 8.6 [Multi-dimensional analysis] In-memory flag change

When the amount of data is large, we can use multiple machines, i.e., a cluster, to share the computing tasks. In a cluster, the machines participating in computation are called node machine (node for short), and usually, there is also a control program that is responsible for managing the computing tasks and assigning tasks to each node and aggregating the calculation results, which is called master computer.

A node is generally a physical machine, because the logical machine cannot share the computing tasks. The master computer is mainly responsible for control, and does not actually undertake a large amount of computation, so it is probably a logical machine, which could be acted as by a certain node thread or an application outside the cluster.

Multi-machine parallel computing is similar to multi-thread parallel computing on a single machine, except that the tasks assigned to threads are assigned to nodes.

The fork statement of SPL also supports multi-machine parallel computing, but it needs to start the esProc server on each node and configure the access address in advance.

A B
1 [“192.168.0.101:8281”,“192.168.0.102:8281”,…, “192.168.0.104:8281”]
2 fork to(4);A1 =file(“orders.ctx”).open().cursor(area,amount)
3 return B2.groups(area;sum(amount):amount)
4 =A2.conj().groups(area;sum(amount))

Since the order table for one year is very large, we divide it into four parts by quarter and store them on four nodes respectively, and each part is named as the same file name (different contents). A1 contains the access address of four nodes (the last section of the address will be used below to represent the node). The fork statement in A2 sends its code block to the esProc on the node for execution. After being executed by each node, the return value will be transmitted to the master computer. Once the master computer collects the return values from all the nodes, it will continue to execute forwards until the final aggregation is finished.

Using multi-cursor to perform parallel computing is also available on nodes:

A B
1 [“192.168.0.101:8281”,“192.168.0.102:8281”,…, “192.168.0.104:8281”]
2 fork to(4);A1 =file(“orders.ctx”).open().cursor@m(area,amount;;4)
3 return B2.groups(area;sum(amount):amount)
4 =A2.conj().groups(area;sum(amount))

This allows multiple machines to share computing tasks and improve computing performance.

In general, the physical configurations of each node in a cluster are the same, and their computing capabilities are also basically the same. In the example above, dividing one year’s data into 4 parts by quarter and placed them on 4 nodes can be considered relatively evenly distributed. In this way, the amount of data traversed by each node is almost the same, and the situation where some nodes have to wait because the computing load of a certain node is too heavy will not occur.

If we change the requirement to: first filter the order table by a specified time period, and then group and summarize, the situation will be different.

Assuming that the time period to be used as the filter condition is within the first quarter, and that the data is ordered by date (which is often the case), then the record that meets the condition can be quickly located by using the ordered characteristics of data. As a result, 102, 103 and 104 will quickly find that they have nothing to do, and almost the whole task is on 101.

If the snake distribution method is adopted, this problem can be avoided. Assuming that the amount of data per day is not much different, we use the remainder of data date divided by 4 to determine which node the data is distributed to, that is, the data of day 1, 5, 9,… is placed on 101; the data of day 2, 6,… is placed on 102;…; the data of day 4, 8,… is placed on 104. Now, if we perform calculation according to the above requirements, no serious task imbalance will occur.

In theory, this will cause new imbalance. For example, when we want to count the data of day 1, 5…, the whole task will be burdened on 101. But in actual business, there is almost no such computing requirement. For most application scenarios, snake distribution is better than sequential distribution.

An appropriate data distribution scheme should be designed according to the calculation objective, and there is no scheme that can adapt to all operations.


Performance Optimization - 9.2 [Cluster] Multi-zone composite table of cluster
Performance Optimization - Preface