Reconfiguration and recovery in distributed memory multicomputers

Peercy, Michael Paul

Reconfiguration and recovery in distributed memory multicomputers

Peercy, Michael Paul

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/19636

Description

Title

Reconfiguration and recovery in distributed memory multicomputers

Author(s)

Peercy, Michael Paul

Issue Date

1994

Doctoral Committee Chair(s)

Banerjee, Prithviraj

Department of Study

Electrical and Computer Engineering

Discipline

Electrical Engineering

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Date of Ingest

2011-05-07T12:13:44Z

Keyword(s)

Engineering, Electronics and Electrical
Computer Science

Language

eng

Abstract

As the sizes of distributed memory multiprocessors increase, the likelihood of a fault removing one of the processors from the system grows as well. Such a fault removes some or all of the following: (a) communication paths, (b) processing power, (c) topological consistency, and (d) progress of the running application. In this thesis we propose solutions to handle all of these problems.
We handle the lost communication paths through table-based routing strategies. We present distributed algorithms for filling the tables, after a fault or repair, with the shortest communication paths surviving in the system. Also, we give algorithms for similarly filling broadcast tables and for guaranteeing that the routing tables are deadlock-free.
To take care of the lost processing power, we propose low-cost hardware reconfiguration schemes in which we embed spare processors throughout the multicomputer. One scheme places each spare processor alongside the normal processor on a node; the other places each spare processor along a selected link. We give results from low-level trace-driven simulation of these schemes with six applications. The results show that the overhead due to the hardware reconfiguration is very low.
We handle the problem of topological consistency with the abstraction of virtual spare processors. A faulty processor's workload is evenly divided among a number of nearby nodes, which timeslice the displaced workload along with their native tasks. We present an implementation of this software technique for static reconfiguration on the iPSC/2 hypercube. We give experimental results of our system for a number of applications.
Finally, the lost progress of the application due to fault is handled through our software technique for reconfiguration and recovery. We take advantage of the characteristics of the Actor model of parallel computation and dynamically shadow and check-point the activity of the application. We have implemented our techniques through modifications of the runtime system for the parallel language Charm running on the iPSC/2. After thoroughly discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of a fault.
We present experimental evaluations of most of the concepts proposed in the thesis, using real parallel applications from the numerical and VLSI CAD domain executing on a real distributed memory multicomputer, an Intel iPSC/2 hypercube.

Type of Resource

text

Permalink

http://hdl.handle.net/2142/19636

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Reconfiguration and recovery in distributed memory multicomputers

Peercy, Michael Paul

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In