Reconfiguration and recovery in distributed memory multicomputers
Peercy, Michael Paul
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/19636
Description
Title
Reconfiguration and recovery in distributed memory multicomputers
Author(s)
Peercy, Michael Paul
Issue Date
1994
Doctoral Committee Chair(s)
Banerjee, Prithviraj
Department of Study
Electrical and Computer Engineering
Discipline
Electrical Engineering
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
Ph.D.
Degree Level
Dissertation
Keyword(s)
Engineering, Electronics and Electrical
Computer Science
Language
eng
Abstract
As the sizes of distributed memory multiprocessors increase, the likelihood of a fault removing one of the processors from the system grows as well. Such a fault removes some or all of the following: (a) communication paths, (b) processing power, (c) topological consistency, and (d) progress of the running application. In this thesis we propose solutions to handle all of these problems.
We handle the lost communication paths through table-based routing strategies. We present distributed algorithms for filling the tables, after a fault or repair, with the shortest communication paths surviving in the system. Also, we give algorithms for similarly filling broadcast tables and for guaranteeing that the routing tables are deadlock-free.
To take care of the lost processing power, we propose low-cost hardware reconfiguration schemes in which we embed spare processors throughout the multicomputer. One scheme places each spare processor alongside the normal processor on a node; the other places each spare processor along a selected link. We give results from low-level trace-driven simulation of these schemes with six applications. The results show that the overhead due to the hardware reconfiguration is very low.
We handle the problem of topological consistency with the abstraction of virtual spare processors. A faulty processor's workload is evenly divided among a number of nearby nodes, which timeslice the displaced workload along with their native tasks. We present an implementation of this software technique for static reconfiguration on the iPSC/2 hypercube. We give experimental results of our system for a number of applications.
Finally, the lost progress of the application due to fault is handled through our software technique for reconfiguration and recovery. We take advantage of the characteristics of the Actor model of parallel computation and dynamically shadow and check-point the activity of the application. We have implemented our techniques through modifications of the runtime system for the parallel language Charm running on the iPSC/2. After thoroughly discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of a fault.
We present experimental evaluations of most of the concepts proposed in the thesis, using real parallel applications from the numerical and VLSI CAD domain executing on a real distributed memory multicomputer, an Intel iPSC/2 hypercube.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.