Fault|Fault Tolerance And Reliability

If all n different events have same mean time m, then the Mean time to the first one of the events = m/n Theorem 1: Mean time to event MT(A)=1/P(A)
Theorem 2: P(A or B) = P(A) + P(B) - P(A and B)
Assuming A and B are independent
= P(A) + P(B) - P(A) * P(B)
= P(A) + P(B) (if P(A) and P(B) are very small)
Theorem 3: 【Fault|Fault Tolerance And Reliability】If events A,B, have mean time MT(A), MT(B), then the mean time to the first event is 1/(P(A) + P(B))
Prove: if p is the probability of an event in given time, then the mean time m = 1/p,
and there are n events, then the probability of one of these events = n * p
Therefore, mean time to one of these events = 1/ n*p = m/n
Fault|Fault Tolerance And Reliability
文章图片

Fault|Fault Tolerance And Reliability
文章图片

Fault|Fault Tolerance And Reliability
文章图片
Capture.PNG Fault Tolerance Strategy:

  1. Fail-vote:
    use two or more modules and compare their outputs, stops if there are no majority outputs agreeing. If fails twices as often with duplication but gives clean failure semantics
    Fault|Fault Tolerance And Reliability
    文章图片
    Capture.PNG
2.Fail-fast:
Similar to the fail vote except the system senses which modules are available and then uses the majority of the available modules.
Improve the software reliability:
  1. Periodic transfer of data: The primary process does all the work until it fails, and the second process called backup takes over the primary and continues
  2. Checkpoint-restart: The primary records its state on a duplexed storage module, at takeover the secondary starts reading the state of the primary from the duplexed storage and resumes the application.
  3. Checkpoint messages: The primary sends its state changes as messages to the backup. At takeover the backup gets its current state from the most recent checkpoint message.
  4. Persistent: backup restarts in the null state and lets Transaction mechanism to clean up all uncommitted transactions. This is the approach taken by the most Database Systems.
  5. Highly available storage
    • write to several storage modules.
    • have some kind of checksum to make sure that the data read is correct with a very high probability.
    • Disk mirroring is an example of this.
    • Shadowing is another mirroring technique which allows atomic write operations.
  6. Highly available Processes
    • process pairing
    • transaction based restart
    • checkpoint restart
Improve the communication reliability Fault|Fault Tolerance And Reliability
文章图片
Capture.PNG

    推荐阅读