"Fault tolerance" or being able to handle any type of fault in itself is a motivation for distributed systems. This is one of the most widely studied topics in the area of Distributed Systems. It has remained one of the hot areas for some obvious reasons -- If you are talking of a distributed environment of thousands of machines, it is evident that almost always, some will fail. Due to this very obvious fact, failures have become the norm rather than an exception.
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.
A poorly designed Distributed System is counter-intuitive and worse than a non- distributed system. Leslie Lamport, known for his seminal work in this area wrote once in an e-mail –
There has been considerable debate over the years about whatconstitutes a distributed system. It would appear that the followingdefinition has been adopted at SRC:A distributed system is one in which the failure of a computer you didn'teven know existed can render your own computer unusable.The current electrical problem in the machine room is not the culprit--it just highlightsa situation that has been getting progressively worse. It seems that each new version ofthe nub makes my FF more dependent upon programs that run elsewhere.Having to wait a few seconds for a program to be swapped in is a lotless annoying than having to wait an hour or two for someone to rebootthe servers. I therefore propose a development project to make oursystem more robust. I am not proposing any particular approach(enabling stand-alone operation is just one possibility).I will begin the effort by volunteering to gather some data on theproblem. If you know of any instance of user's FF becoming inoperativethrough no fault of its own, please send me a message indicating theuser, the time, and the cause (if known).
Any behavior can be classified in a failure model if it does not comply with the designed protocol or contract.
Lets look at some of the failure models considered in dealing with fault tolerance –
Failure models can be classified into two broad categories –
Let's classify these even further
This is self explanatory and is the model that deals with crash of a node in the system.
Imagine a node running normally but just misses to send or receive messages; say due to some reasons such as full buffer, slow processing etc. This type of failure is classified into the omission failures model. These failures are challenging to address. Crash failures can be viewed as a special case of Omission failure. Systems that tolerate omission failures will be able to tolerate crash failures. The reverse is not true. An omission failure could be either due to send omission or receive omission. They could also happen due to infinite loops, improper memory management etc.
Let's say the server’s response lies outside the specified time interval. The server replies too late due to performance issues or the server is provided with data ‘too soon’ that it does not have enough buffer to hold the data.
This is one of the serious types of failures where the server’s response is simply incorrect. Either it can simply return incorrect data which can be called as “value failure” or can go into a state transition failure in case of an unexpected request; just like calling a bad function into the default case of a switch case block.
This model classifies failures (aberration) caused due to a malicious node that is controlled by some attacker and is the most serious of all. Interestingly, even though big companies such as Google, Facebook, Amazon, etc. are extremely wary about security and expect no Byzantine failures due to malicious behavior per se, they are still concerned about Byzantine failures due to a very different reason. Remember – we are now speaking of data at Google scale. Even smallest order of data corruption at a site can propagate to all the sites leading to a major overhead! Measures to eschew Byzantine failures thus needed for data at large scale. Byzantine failures were first analyzed by Pease et al. (1980) and Lamport et al. (1982).
In this case, the node is just uncooperative. Consider a P2P distributed scenario of torrent. You are one of the participant but selfish enough to download at full bandwidth but refuse to seed even a bit!
Communication channels involved can be prone to failures and drop messages transmitted via the network.
Many other failure models can be defined specific to any context. But these definitions are rarely used and hence not documented in the general sense.
Also look at Part 2: Timing models used in distributed context.