Withdraw
Loading…
Scalable message-logging techniques for effective fault tolerance in HPC applications
Meneses Rojas, Esteban
Loading…
Permalink
https://hdl.handle.net/2142/45447
Description
- Title
- Scalable message-logging techniques for effective fault tolerance in HPC applications
- Author(s)
- Meneses Rojas, Esteban
- Issue Date
- 2013-08-22T16:40:26Z
- Director of Research (if dissertation) or Advisor (if thesis)
- Kale, Laxmikant V.
- Doctoral Committee Chair(s)
- Kale, Laxmikant V.
- Committee Member(s)
- Cappello, Franck
- Heath, Michael T.
- Vaidya, Nitin H.
- Bronevetsky, Greg
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Message-logging
- Fault Tolerance
- High Performance Computing (HPC)
- Resilience
- Abstract
- An important set of challenges emerge as the High Performance Computing (HPC) community aims to reach extreme scale. Resilience and energy consumption are two of those challenges. Extreme-scale machines are expected to have a high failure frequency. This is an inevitable consequence of the mismatch between two trends. The number of components assembled in supercomputers grows exponentially. However, the improvement on the reliability of each individual component is much slower. At the same time, the vast number of components in a single machine will consume a non-trivial amount of energy. To keep a supercomputer within operational margins, HPC systems have to be both reliable and energy-aware. For an application to be able to run and make progress in spite of constant interruptions, it has to incorporate some fashion of fault tolerance. Rollback-recovery techniques provide a framework to overcome crashes in the system by periodically saving the state of the application and rolling back to checkpoints in case of failures. Two well-known rollback-recovery techniques are checkpoint/restart and message-logging. The former is easier to implement and has become the de facto standard to make applications fault tolerant. It has, however, a high performance and energy cost during recovery. Message-logging, on the other hand, makes it possible to recover faster from a failure and to consume less energy. The downside of message-logging is the overhead it exhibits in the failure-free scenario. Memory and performance overheads may offset its advantages. This thesis focuses on techniques to alleviate the downsides of message-logging. It presents a mechanism based on high-level programming language constructs to decrease the performance overhead of message-logging. It also introduces two strategies to reduce the memory overhead created by the message log. Additionally, it addresses important architectural constraints of modern supercomputers. Based on large-scale experimental results and projections from an analytical model, we conclude message-logging is a promising strategy to provide fault tolerance at a low energy cost for extreme-scale machines.
- Graduation Semester
- 2013-08
- Permalink
- http://hdl.handle.net/2142/45447
- Copyright and License Information
- Copyright 2013 Esteban Meneses Rojas
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…