Manual and compiler assisted methods for generating fault-tolerant parallel programs
Roy-Chowdhury, Amber
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/21116
Description
Title
Manual and compiler assisted methods for generating fault-tolerant parallel programs
Author(s)
Roy-Chowdhury, Amber
Issue Date
1996
Doctoral Committee Chair(s)
Banerjee, Prithviraj
Department of Study
Electrical and Computer Engineering
Discipline
Electrical Engineering
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
Ph.D.
Degree Level
Dissertation
Keyword(s)
Engineering, Electronics and Electrical
Computer Science
Language
eng
Abstract
Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose muiticomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they have been located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this dissertation, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all of the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a distributed memory multicomputer, which demonstrate acceptably low overheads for the single- and double-fault location and recovery cases.
For a class of algorithms performing affine transformations, we automate the process of generating an error-detecting version at compile time. The compiler is used to identify loops that perform affine transformations on array elements. These loops are then checked by computing a checksum over the array elements being transformed and transforming the checksums appropriately, which typically results in much smaller overheads than checking the entire code by duplication. Portions of code in the program that are not affine transformations are checked by duplication. An existing source-to-source compiler, Parafrase-2, has been modified to take in programs written in High Performance Fortran (HPF) and output an error-detecting version of the same. Data distributions for the new arrays and checksums introduced are specified by inserting additional HPF directives in the program. The modified program can then be input to a parallelizer for distributed memory machines, such as PARADIGM, to obtain an error-detecting parallel program. We demonstrate results on three numerical programs by executing the error-detecting versions generated by our compiler on a distributed memory multicomputer.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.