Performance Optimization - 9.1 [Cluster] Computation and data distribution

 

When the amount of data is large, multiple machines can be used to share computing tasks, that is, the cluster. In a cluster, the machines participating in the computing are called node machine, and there is usually a control program for managing and assigning the computing tasks to each node, and aggregating the computing results, this program is called master computer.

The node is generally a physical machine, because the logical machine can not share the computing tasks. The master computer is mainly responsible for control, and does not actually undertake much computation, so it is probably a logical machine. A thread of a certain node or an application outside the cluster can serve as the master computer.

The multi-machine parallel computing is similar to that of multi-thread on a single machine, except that the tasks assigned to the thread are assigned to nodes.

SPL’s fork statement can also support multi-machine parallel computing, but it needs to start the esProc server on each node and configure the access address in advance.

A B
1 [“192.168.0.101:8281”,“192.168.0.102:8281”,…, “192.168.0.104:8281”]
2 fork to(4);A1 =file(“orders.ctx”).open().cursor(area,amount)
3 return B2.groups(area;sum(amount):amount)
4 =A2.conj().groups(area;sum(amount))

Since the order table of one year is very large, we divide it into four parts by quarter and store them on four nodes respectively, and each part is named as the same file name (different contents). A1 is the access address of four nodes (the last section of the address is used to represent the node below). The fork statement in A2 sends its code block to the esProc on the node for execution. After being executed by each node, the return value will be transmitted to the master computer. Following the collection of the return value of all nodes, the master computer will continue to execute forwards until the final aggregation is finished.

Using multi-cursor to perform parallel computing is also available on nodes:

A B
1 [“192.168.0.101:8281”,“192.168.0.102:8281”,…, “192.168.0.104:8281”]
2 fork to(4);A1 =file(“orders.ctx”).open().cursor@m(area,amount;;4)
3 return B2.groups(area;sum(amount):amount)
4 =A2.conj().groups(area;sum(amount))

In this way, multiple machines can be used to share computing tasks, and improve computing performance.

Generally, the physical configuration of each node in a cluster is the same, and the computing power is basically the same. In the example above, one year’s data is divided into 4 quarters and placed on 4 nodes, which can be considered as relatively average. In this way, the amount of data traversed by each node is almost the same, and hence it will not happen that some nodes have to wait due to too heavy task burdened on a certain node.

If the requirement is changed to filter the order table by the specified time period first followed by grouping and aggregating, the situation will be different.

Assume that the time period, as the filter condition, is within the first quarter, and that the data is ordered by date (it is often the case), then the records that meet the conditions can be quickly located by using the orderly characteristics of data. As a result, 102, 103 and 104 will quickly find that they have nothing to do, and almost the whole task is on 101.

If the snake distribution method is adopted, this problem can be avoided. Assuming that there is little difference in the amount of data each day, we use the remainder of dividing the date of the data by 4 to determine which node the data is distributed to, that is, the data of day 1, 5, 9,… are placed on 101; the data of day 2, 6,… are placed on 102;…and the data of day 4, 8,… are placed on 104. In this way, if the requirement mentioned above is performed, no serious task imbalance will occur.

Theoretically, this will cause new imbalance. For example, when we want to count the data of day 1, 5…, the whole task will be burdened on 101. But in actual business, there is almost no such computing requirements. For most application scenarios, the snake distribution is better than the sequential distribution.

A reasonable data distribution scheme should be designed according to the calculation objective, and there is no scheme that can adapt to all operations.