Fault Tolerant Computer Architecture-P4: For many years, most computer architects have pursued one primary goal: performance. Architects have translated the ever-increasing abundance of ever-faster transistors provided by Moore’s law into remarkable increases in performance. Recently, however, the bounty provided by Moore’s law has been accompanied by several challenges that have arisen as devices have become smaller, including a decrease in dependability due to physical faults | 19 CHAPTER 2 Error Detection Error detection is the most important aspect of fault tolerance because a processor cannot tolerate a problem of which it is not aware. Even if the processor cannot recover from a detected error the processor can still alert the user that an error has occurred and halt. Error detection thus provides at the minimum a measure of safety. A safe processor does not do anything incorrect. Without recovery the processor may not be able to make forward progress but at least it is safe. It is far preferable for a processor to do nothing than to silently fail and corrupt data. In this chapter as well as subsequent chapters we divide our discussion into general concepts and domain-specific solutions. These processor domains include microprocessor cores Section caches and memories Section and multicore memory systems Section . We divide the discussion in this fashion because the issues in each domain tend to be quite distinct. GENERAL CONCEPTS There are some fundamental concepts in error detection that we discuss now so as to better understand the applications of these concepts to specific domains. The key to error detection is redundancy a processor with no redundancy fundamentally cannot detect any errors. The question is not whether to use redundancy but rather what kind of redundancy should be used. The three classes of redundancy physical sometimes referred to as spatial temporal and information are described in Table . All error detection schemes use one or more of these types of redundancy and we now discuss each in more depth. P hysical Redundancy Physical or spatial redundancy is a commonly used approach for providing error detection. The simplest form of physical redundancy is dual modular redundancy DMR with a comparator illustrated in Figure . DMR provides excellent error detection because it detects all errors except for errors due to design bugs errors in the comparator and unlikely combinations of .