Performance Optimization - 9.7 [Cluster] Multi-job load balancing
Similar to the multi-thread parallel computing on a single machine, the multi-machine parallel computing framework introduced in the first section will also wait for the slowest node to return the result before proceeding. We can try to make the amount of data to be calculated by each node more balanced, but we can’t guarantee the execution speed of each node is the same, and hence the phenomenon of waiting for the slow node by the fast node is still unavoidable. In most cases, such wait will have little impact, but it is intolerable in pursuit of ultimate performance.
Theoretically, just like the way for handling multi-thread, we can split the task into smaller pieces and dynamically allocate them to balance the load on each node, so as to reduce the wait. However, the multi-machine situation is more complicated since the dynamic load balancing can be achieved only with data redundancy. If a certain data is saved only on one node, then only this node can calculate it. In this case, other nodes have to wait even if they are faster, otherwise, the data needs to be transmitted through the network, which will not only cause network latency, but also consume the computing resources of the node where the data is located. As a result, the loss outweighs the gain. Since multiple threads on the same machine share the stored data, this problem does not exist.
The callx()function can implement this mechanism. The sub.dfx in the parameter is the script for calculating the job on the node, taking the job number as the parameter. A2 will generate 1000 jobs, only part of which will be allocated to nodes at first to make the job number of each node reach its limit, and then the node that finishes a job will be allocated another job, hereby achieving the dynamic load balancing. When a calculation error occurs because the data for a certain job is not on the allocated node, the callx() will reallocate this job to another node. If no node can execute the job, the calculation fails.
At the end of calculation, A2 will obtain a sequence composed of the return value of every sub.dfx, which is in order of job number.
There is another problem when the job amount is large. After each job is finished, the result will be returned to the master computer, resulting in too much number of returns, this situation will also bring a heavier burden to the network. Fortunately, many operations allow the node to return the results after performing a round of aggregation, in this way, the number of returns will be reduced to that of nodes.
The callx() function has another parameter that indicates the operation script for aggregating first.
Assuming that the calculation task is to sum the return values of each sub.dfx, the code of reduce.dfx is:
where, p1 and p2 are the two parameters of reduce.dfx; p2 is the return value of each job, and p1 is the current aggregation value of the node. Each time the node finishes a job, it will call the reduce.dfx so as to obtain a new aggregation value to replace the current one. After the node finishes all jobs, it will transmit the last aggregation value, as its return value, to the master computer. At this point, A2 will get a sequence composed of the return values of all nodes, in the order of the node in its parameters (i.e., A1). The p1 and p2 here are a bit like ~~ and ~ in the iteration function.
In distributed computing terminology, such aggregation action performed by node is called reduce.