Fault-tolerant systems
Fault-tolerant systems
Systems, predominantly computing and computer-based systems, which tolerate undesired changes in their internal structure or external environment. Such changes, generally referred to as faults, may occur at various times during the evolution of a system, beginning with its specification and proceeding through its utilization. Faults that occur during specification, design, implementation, or modification are called design faults; those occurring during utilization are referred to as operational faults, The use of fault tolerance techniques is based on the premise that a complex system, no matter how carefully designed and validated, is likely to contain residual design faults and to encounter unpreventable operational faults.
Generally, fault tolerance techniques attempt to prevent lower-level errors (caused by faults) from propagating into system failures. By using various types of structural and informational redundancy, such techniques either mask a fault (no errors are propagated to the faulty subsystem's output) or detect a fault (via an error) and then effect a recovery process which, if successful, prevents a system failure. In the case of a permanent internal fault, the recovery process usually includes some form of structural reconfiguration (for example, replacement of a faulty subsystem with a spare or use of an alternate program) which prevents the fault from causing further errors. Typically, a fault-tolerant system design will incorporate a mix of fault tolerance techniques which complement the techniques used for fault prevention. See Software engineering