Performance Optimization - 9.6 [Cluster] Spare-wheel-pattern fault tolerance

 

Performance Optimization - 9.5 [Cluster] Redundancy-pattern fault tolerance

Loading data into memory in advance can obtain much better performance than external storage. When the amount of data is so large that the memory of one machine cannot hold, we can use the nodes of a cluster to load the data in segments so as to share the computational load. The multi-machine parallel computing framework and cluster table mentioned above can also support in-memory computation.

However, in a cluster environment, we also face the problems of fault tolerance and fault tolerance degree. Moreover, the redundancy-pattern fault tolerance scheme mentioned in the previous section cannot be used for the data in memory.

In the redundancy-pattern fault tolerance scheme, we redundantly store data k times, that is, we use k+1 copies of data to achieve a fault tolerance degree of k. In other words, to achieve a fault tolerance degree of k, the utilization rate of storage is only 1/k. This is acceptable for external storage, because hard disks are very cheap and can be expanded almost infinitely. However, since memory is much more expensive and has an upper limit on capacity expansion, a utilization rate of only 1/k is unacceptable.

To solve this problem, we can use the spare-wheel-pattern fault tolerance for memory, that is, still divide the data into n zones, and then load these zones into n nodes respectively, and finally prepare k idle nodes as spare node. In this way, when a running node fails, a spare node will be immediately started to load the data of the failed node, and will reconstitute a cluster with complete data together with other nodes to continue to provide services. After troubleshooting the failed node and restoring to normal operation, it can serve as a spare node. The whole process is very similar to replacing a spare wheel of a car.

If more than k nodes fail at the same time, even if all k spare nodes are used up, a cluster with complete data cannot be constituted. At this time, we have to declare the cluster failed.

The memory utilization of this scheme can be as high as n/(n+k), which is much higher than 1/k of redundancy-pattern fault tolerance. In this scheme, the amount of data to be loaded into memory is usually not very large, and the instant loading time is not long after a node fails, so the cluster service can be restored quickly. In contrast, if the spare-wheel-pattern fault tolerance is used for external storage, the amount of data that needs to be instantly prepared in case of a node failure may be very large, which will cause the cluster to be unable to provide services for a long time.

The node that loads the data into memory needs a loading action:

A B
1 if z>0 >hosts(z)
2 =file(“orders.ctx”:z).open().memory()
3 >env(ORDERS,B2)

The parameter z represents the zone number. You need to use the hosts() function to register the zone number of in-memory data of the node on esProc server, then load the data and name it as a global variable of the node.

At this point, the computation code can use the loaded in-memory data:

A
1 [“192.168.0.101:8281”,“192.168.0.102:8281”,…, “192.168.0.105:8281”]
2 =hosts(4,A1)
3 =memory(A2,ORDERS)
4 =A3.cursor().groups(area;sum(amount))

Note that there are 5 nodes in A1, and the hosts()function needs to find 4 normally-run nodes (corresponding to 4 zones) from these nodes. If not found, it will return a signal indicating that it has failed. Once found, it will check for the completeness of the data zone numbers loaded on 4 nodes. If a missing zone number is found, it will take the zone number as a parameter to execute the loading code on the node to form a cluster with complete data, and then generate a cluster in-memory table for calculation.


Performance Optimization - 9.7 [Cluster] Multi-job load balancing
Performance Optimization - Preface