Withdraw
Loading…
From experiment to design – fault characterization and detection in parallel computer systems using computational accelerators
Yim, Keun Soo
Loading…
Permalink
https://hdl.handle.net/2142/44390
Description
- Title
- From experiment to design – fault characterization and detection in parallel computer systems using computational accelerators
- Author(s)
- Yim, Keun Soo
- Issue Date
- 2013-05-24T22:10:01Z
- Director of Research (if dissertation) or Advisor (if thesis)
- Iyer, Ravishankar K.
- Doctoral Committee Chair(s)
- Iyer, Ravishankar K.
- Committee Member(s)
- Sha, Lui R.
- Campbell, Roy H.
- Abdelzaher, Tarek F.
- Chen, Shuo
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Fault tolerance system design
- Experimental validation
- Error detection
- Fault injection
- Measurement-based co-design
- Graphics Processing Unit fault tolerance
- Message Passing Interface
- CPU-GPU hybrid computers
- COTS-based mission-critical systems
- Reliability
- Dependability
- Abstract
- This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components. This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads.
- Graduation Semester
- 2013-05
- Permalink
- http://hdl.handle.net/2142/44390
- Copyright and License Information
- Copyright 2013 Keun Soo Yim
Owning Collections
Dissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceGraduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…