A fault may disrupt operation in a system by damaging the states of some data and processes. The focus of recovery is to restore some data or process(es) to a consistent state such that normal operation can be restored. Fault tolerance provides uninterrupted operation of a system despite faults. This chapter discusses recovery and fault tolerance techniques used in a distributed operating system. Resiliency, which is a technique for minimizing the impact of a fault, is also discussed. | Chapter 19 Recovery and Fault Tolerance Copyright © 2008 Operating Systems, by Dhananjay Dhamdhere Introduction Faults, Failures, and Recovery Byzantine Faults and Agreement Protocols Recovery Fault Tolerance Techniques Resiliency 19. Operating Systems, by Dhananjay Dhamdhere Copyright © 2008 Operating Systems, by Dhananjay Dhamdhere Faults, Failures, and Recovery A fault may damage the state of a system Error: a part of the system state that is erroneous Failure: unexpected behavior or situation 19. Operating Systems, by Dhananjay Dhamdhere Copyright © 2008 Operating Systems, by Dhananjay Dhamdhere Faults, Failures, and Recovery (continued) Recovery: for reliable operation, system is restored to a consistent state, and operation resumed A recovery is performed when a failure is noticed 19. Operating Systems, by Dhananjay Dhamdhere Copyright © 2008 Operating Systems, by Dhananjay Dhamdhere Classes of Faults Fault model: properties that determine the kinds . | Chapter 19 Recovery and Fault Tolerance Copyright © 2008 Operating Systems, by Dhananjay Dhamdhere Introduction Faults, Failures, and Recovery Byzantine Faults and Agreement Protocols Recovery Fault Tolerance Techniques Resiliency 19. Operating Systems, by Dhananjay Dhamdhere Copyright © 2008 Operating Systems, by Dhananjay Dhamdhere Faults, Failures, and Recovery A fault may damage the state of a system Error: a part of the system state that is erroneous Failure: unexpected behavior or situation 19. Operating Systems, by Dhananjay Dhamdhere Copyright © 2008 Operating Systems, by Dhananjay Dhamdhere Faults, Failures, and Recovery (continued) Recovery: for reliable operation, system is restored to a consistent state, and operation resumed A recovery is performed when a failure is noticed 19. Operating Systems, by Dhananjay Dhamdhere Copyright © 2008 Operating Systems, by Dhananjay Dhamdhere Classes of Faults Fault model: properties that determine the kinds of errors/failures that might result from a fault Classes of faults: System fault system crash Amnesia and partial amnesia faults A fail-stop fault brings a system to a halt Process fault Byzantine faults: malicious or arbitrary actions Storage fault amnesia faults Communication fault nonamnesia faults 19. Operating Systems, by Dhananjay Dhamdhere Copyright © 2008 Operating Systems, by Dhananjay Dhamdhere Overview of Recovery Techniques For non-Byzantine faults, recovery involves restoring system or application to a consistent state Involves reexecuting some actions 19. Operating Systems, by Dhananjay Dhamdhere Copyright © 2008 Operating Systems, by Dhananjay Dhamdhere Overview of Recovery Techniques (continued) Recovery approaches are classified into: Backward recovery: resetting state of entity affected by fault to a prior state and resuming its operation Involves reexecution of some actions Forward recovery: repairing erroneous state of a system so system can .