A Theory for Algorithm-Based Fault Tolerance in Array Processor Systems
Banerjee, Prithviraj
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/69305
Description
Title
A Theory for Algorithm-Based Fault Tolerance in Array Processor Systems
Author(s)
Banerjee, Prithviraj
Issue Date
1985
Department of Study
Electrical Engineering
Discipline
Electrical Engineering
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
Ph.D.
Degree Level
Dissertation
Keyword(s)
Computer Science
Abstract
The concept of algorithm-based fault-tolerance deals with system-level methods for obtaining reliable results from computations, especially when performed on array processor systems. In this scheme, algorithms have their outputs encoded in a system-level error-detecting or -correcting code. The algorithms rearrange the computations to allow an array processor system with a faulty processor to produce either a noncodeword output or the correct output. The algorithms also provide appropriate error-checking procedures which operate on the high-level encoded data.
This thesis deals with a theoretical study of the scheme of algorithm-based fault tolerance and addresses four issues. First, it deals with some design issues of specific fault-tolerant and fault-secure schemes. Algorithms are classified into broad classes called paradigms which are determined exclusively by the communication patterns of the processors. Fault-secure techniques are presented for three powerful paradigms: the multiplex, the recursive combination, and the multiplex-demultiplex paradigms.
The second part deals with the development of a model which can be used to analyze the fault-detecting and -locating capabilities of such algorithms. The model uses a broad interpretation of errors, faults and checks, which are represented as a tripartite graph. Three parameters are introduced to characterize the fault-tolerance scheme: the closure index, the masking index and the exposure index. Necessary and sufficient conditions for detecting and locating faults in processors during the actual computations are discussed. The model takes into account the data flow during a computation and can, therefore, pinpoint exactly which faulty processors can affect which data elements.
In the third part, some graph-theoretic bounds are presented on various useful characteristics in algorithm-based fault tolerance. The model is used to determine bounds on the number of data elements that a processor may affect while allowing t-fault detection or t-fault location. Using these results, some upper and lower bounds are presented on the number of checks required to achieve detection or location. Finally, in order to estimate the overhead required in this fault-tolerant scheme, some bounds are derived on the number of processors and the time required for the execution of the checks.
The last part of the thesis deals with a probabilistic study of the scheme. Expressions for the reliability of the results of the computation, and the time for the completion of the computation using a particular algorithm, are derived in terms of various parameters: the number of processors involved, the time for execution, and the fault-detecting, -locating, and -tolerating capabilities of the algorithm. Using the results of the probabilistic model, various alternative approaches of algorithm-based fault tolerance for a given problem are effectively compared.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.