Performance Optimization - 9.6 [Cluster] Spare-wheel-pattern fault tolerance


Loading the data into memory in advance can obtain much better performance than external storage. When the amount of data is so large that the memory of one machine cannot hold, we can use the nodes of a cluster to load in segments so as to share the calculation at the same time. The multi-machine parallel computing framework and cluster table mentioned above can also support in-memory operations.

However, in a cluster environment, we will also face the problems of fault tolerance and x fault tolerance. Moreover, for the data in memory, the redundancy-pattern fault tolerance scheme mentioned in the previous section cannot be used.

In the redundancy-pattern fault tolerance scheme, we duplicate k more copies of the data, that is, the k fault tolerance is obtained by k+1 copies of data. In other words, in order to obtain k fault tolerance, the utilization rate of storage is only 1/k. This is acceptable for external storage, because the hard disk is cheap enough and its capacity can be expanded almost infinitely. However, since memory is much more expensive and there is an upper limit on capacity expansion, a utilization rate of only 1/k is intolerable.

In order to solve this problem, we can use the spare-wheel-pattern fault tolerance, which works in a way that still divides the data into n zones, loads these zones on n nodes respectively, and then prepares k idle nodes as spare node. In this way, when a running node fails, a certain spare node will be started immediately to load the data of the failed node, and reconstitute a cluster with complete data together with other nodes to continue to provide services. After troubleshooting, the failed node will return to normal state, and can be used as a spare node. The whole process is very similar to the mode of replacing the spare wheel of a car.

If more than k nodes fail at the same time, the cluster with complete data cannot be constituted even if all k spare nodes are used up. At this time, we have to declare that the cluster fails.

The memory utilization rate of this scheme can reach up to n/(n+k), which is much higher than 1/k of redundancy-pattern fault tolerance. In this scheme, the amount of data to be loaded into memory is usually not very large, and the instant loading time is not much when the node fails, and hence the cluster service can be restored quickly. Conversely, if the spare-wheel-pattern fault tolerance is used for external storage, the amount of data that needs to be instantly prepared in case of node failure may be very large, which will cause the cluster to be unable to provide services for a long time.

The node that loads the data into memory needs a loading action:

1 if z>0 >hosts(z)
2 =file(“orders.ctx”:z).open().memory()
3 >env(ORDERS,B2)

Parameter z represents the zone number. Use the hosts() function to register the zone number of in-memory data of the node on esProc server, and then load the data and name it as a global variable of the node.

At this point, the operation code can use the loaded in-memory data:

1 [“”,“”,…, “”]
2 =hosts(4,A1)
3 =memory(A2,ORDERS)
4 =A3.cursor().groups(area;sum(amount))

Note that there are 5 nodes in A1, and the hosts()function needs to find 4 normally-run nodes (corresponding to 4 zones) from these nodes. If not found, it will return a signal indicating that it has failed. Once found, it will check for the completeness of the data zone numbers loaded on 4 nodes. If there is a missing zone number, it will take the zone number as a parameter to execute the loading code on the node so as to form a cluster with complete data, and then generate a cluster in-memory table for calculation.