Checkpoint-based forward recovery using lookahead execution and rollback validation in parallel and distributed systems
Long, Junsheng
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/22741
Description
Title
Checkpoint-based forward recovery using lookahead execution and rollback validation in parallel and distributed systems
Author(s)
Long, Junsheng
Issue Date
1992
Doctoral Committee Chair(s)
Abraham, Jacob A.
Fuchs, W. Kent
Department of Study
Electrical and Computer Engineering
Discipline
Electrical and Computer Engineering
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
Ph.D.
Degree Level
Dissertation
Keyword(s)
Engineering, Electronics and Electrical
Computer Science
Language
eng
Abstract
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in parallel and distributed systems. The approach uses replicated tasks executing on different processors for forward recovery and checkpoint comparison for error detection. To reduce overall redundancy, this approach employs a lower static redundancy in the common error-free situation to detect error than the standard N Module Redundancy scheme (NMR) does to mask off errors. For the rare occurrence of an error, this approach uses some extra redundancy for recovery. To reduce the run-time recovery overhead, lookahead processes are used to advance computation speculatively and a rollback process is used to produce a diagnosis for correct lookahead processes without rollback of the whole system. Both analytical and experimental evaluation have shown that this strategy can provide a nearly error-free execution time even under faults with a lower average redundancy than NMR.
Using checkpoint comparison for error detection calls for a static checkpoint placement in user programs. Checkpoint insertions based on the system clock produce dynamic checkpoints. A compiler-enhanced polling mechanism using instruction-based time measures is utilized to insert static checkpoints into user programs automatically. The technique has been implemented in a GNU CC compiler for Sun workstations. Experiments demonstrate that the approach provides stable checkpoint intervals and reproducible checkpoint placements with performance overhead comparable to a previous compiler-assisted dynamic scheme (CATCH).
Obtaining a consistent recovery line is another issue to consider in this forward recovery strategy. Checkpointing concurrent processes independently may lead to an inconsistent recovery line that causes rollback propagations. In this thesis, an evolutionary approach to establish a consistent recovery line with low overhead is also described. This approach starts a checkpointing session by checkpointing each process locally and independently. During the checkpoint session, those local checkpoints may be updated, and this updating drives the recovery line evolve into a consistent line. Unlike the globally synchronized approach, the evolutionary approach requires no synchronization protocols to reach a consistent state for checkpointing. Unlike the communication synchronized approach, this approach avoids excessive checkpointing by providing a controllable checkpoint placement. Unlike the loosely synchronized schemes, this approach requires neither message retry nor message replay during recovery.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.