Performance Optimization - 9.5 [Cluster] Redundancy-pattern fault tolerance

 

Performance Optimization - 9.4 [Cluster] Segmented dimension table

Fault tolerance must be considered when doing cluster computing. In the case of single-machine operation, if the machine fails, the computation also fails. In cluster computing, if only a few nodes fail, the cluster may still continue to function effectively.

To be fault-tolerant, redundancy is necessary. If we follow the situations discussed in the previous sections, that is, let each node only hold the data of one zone, then the failure of any node will cause the data to be incomplete, making it impossible to perform correct computation.

The esProc server can store the data from multiple zones at the same time. When a node fails, if all zones can be found on the remaining nodes, then the calculation can still proceed. However, the number of tasks executed by some nodes will increase, and the overall computing performance may also decrease.

We still take 4 zones as an example and set 2 zones on each node, as follows:

101: zone 1, 2; 102: zone 2, 3; 103: zone 3, 4; 104: zone 4, 1

Let’s examine the previous code:

A
1 [“192.168.0.101:8281”,“192.168.0.102:8281”,…, “192.168.0.104:8281”]
2 =file(“orders.ctx”:[1,2,3,4],A1)

If all nodes run normally, the file() function in A2 will take zone 1 on 101, zone 2 on 102,…, zone 4 on 104, forming a cluster file. If 103 fails, resulting in a failure to find zone 3 on 103, it will continue to search 104. If zone 3 is still not found, it will turn back to 101 to search, and will then find and take zone 3 on 102, and finally take zone 4 on 104. As a result, the four zones of this cluster file will be taken from 101, 102, 102, and 104 respectively, and 102 will execute the operation of two zones.

Such scheme, which uses data by multiple times to achieve fault tolerance, is called redundancy-pattern fault tolerance.

There is a problem here. We previously assumed that when each node holds only one zone, each zone in the multi-zone composite table of cluster can only correspond to a unique node. However, if there are redundant zones on a node, this correspondence is not unique. For example, we can take out four zones from 101, 102, 103 and 104 respectively, and we can also take out the same zones from 104, 101, 102 and 103. For single-table operations, the distribution of zones will not affect the calculation results. However, for multi-table association operations (such as homo-dimension association), the same zone distribution is required to continue the operation, because zone 1 on 101 and zone 1 on 104 cannot be associated.

A
1 [“192.168.0.101:8281”,“192.168.0.102:8281”,…, “192.168.0.104:8281”]
2 =file(“A.ctx”:[1,2,3,4],A1)
3 =file(“B.ctx”,A2)

The file() function can generate a new cluster file according to the known zone distribution scheme of a certain cluster file, and A3 will use the same scheme as A2. In this way, the association calculation can be performed based on the generated cluster table.

Let’s continue to examine the distribution scheme just mentioned:

101: zone 1, 2; 102: zone 2, 3; 103: zone 3, 4; 104: zone 4, 1

If both 103 and 104 fail, all zones cannot be found on 101 and 102, which makes it impossible to continue to calculate. In other words, such redundant distribution scheme only tolerates the failure of one node. A deeper analysis reveals that this scheme will continue to work when any one node fails, but will not work once any two nodes fail.

We call the fault tolerance **degree **of this distribution scheme as 1. However, for the scheme with only one zone for one node in the previous example, its fault tolerance degree is 0. For the cluster with n nodes, if each node has full data, its fault tolerance degree is n-1. When the fault tolerance degree is k (the cluster can still operate when any k nodes fail; the cluster fails when k+1 nodes fail), the data is required to be redundant by k times (that is, in addition to the original data, k more copies should be stored). Fault tolerance degree is sometimes called redundancy degree.

For the cluster with n nodes, we can simply use the **loop distribution **scheme to achieve k fault tolerance degree:

Node 1: zone 1, zone 2,…, zone k+1
Node 2: zone 2, zone 3,…, zone k+2
…
Node i: zone i, zone i+1,…, zone i+k
…
Node n: zone n, zone 1,…, zone k

where, zone m is equivalent to zone m-n when m>n.


Performance Optimization - 9.6 [Cluster] Spare-wheel-pattern fault tolerance
Performance Optimization - Preface