Withdraw
Loading…
Mitigation of failures in high performance computing via runtime techniques
Ni, Xiang
Loading…
Permalink
https://hdl.handle.net/2142/92798
Description
- Title
- Mitigation of failures in high performance computing via runtime techniques
- Author(s)
- Ni, Xiang
- Issue Date
- 2016-07-12
- Director of Research (if dissertation) or Advisor (if thesis)
- Kalé, Laxmikant V
- Doctoral Committee Chair(s)
- Kalé, Laxmikant V
- Committee Member(s)
- Vaidya, Nitin
- Kramer, William
- Cappello, Franck
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- runtime system
- fault tolerance
- silent data corruption
- hard error
- soft error
- solid state disk
- high performance computing
- Abstract
- As machines increase in scale, it is predicted that failure rates of supercomputers will correspondingly increase. Even though the mean time to failure (MTTF) of individual component is high, the large number of components significantly decreases the system MTTF. Meanwhile, the decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, silent data corruptions (SDC) are likely to occur more frequently. SDCs do not inhibit execution, but may silently lead to incorrect results. In this thesis, we leverage runtime system and compiler techniques to mitigate a significant fraction of failures automatically with low overhead. The main goals of various system-level fault tolerance strategies designed in this thesis are: reducing the extra cost added to application execution while improving system reliability; automatically adjusting fault tolerance decisions without user intervention based on environmental changes; protecting applications not only from fail-stop failures but also from silent data corruptions. The main contributions of this thesis are development of a semi-blocking checkpoint protocol that overlaps application execution with fault tolerance operation to reduce the overhead of checkpointing, a runtime system technique for automatic checkpoint and restart without user intervention, a holistic framework (ACR) for automatically detecting and recovering from silent data corruptions and a framework called FlipBack that provides targeted protection against silent data corruption with low cost.
- Graduation Semester
- 2016-08
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/92798
- Copyright and License Information
- Copyright 2016 Xiang Ni
Owning Collections
Dissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceGraduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…