Resiliency of high-performance computing systems: A fault-injection-based characterization of the high-speed network in the blue waters testbed

Tang, Sharon S.

Resiliency of high-performance computing systems: A fault-injection-based characterization of the high-speed network in the blue waters testbed

Tang, Sharon S.

Content Files

TANG-THESIS-2018.pdf

Permalink

https://hdl.handle.net/2142/102953

Description

Title

Resiliency of high-performance computing systems: A fault-injection-based characterization of the high-speed network in the blue waters testbed

Author(s)

Tang, Sharon S.

Issue Date

2018-12-11

Director of Research (if dissertation) or Advisor (if thesis)

Kalbarczyk, Zbigniew T.

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Date of Ingest

2019-02-08T18:44:42Z

Keyword(s)

Resiliency
Reliability
Fault Injections
High-Performance Computing
Interconnects

Abstract

Supercomputers have played an essential role in the progress of science and engineering research. As the high-performance computing (HPC) community moves towards the next generation of HPC computing, it faces several challenges, one of which is reliability of HPC systems. Error rates are expected to significantly increase on exascale systems to the point where traditional application-level checkpointing may no longer be a viable fault tolerance mechanism. This poses serious ramifications for a system's ability to guarantee reliability and availability of its resources. It is becoming increasingly important to understand fault-to-failure propagation and to identify key areas of instrumentation in HPC systems for avoidance, detection, diagnosis, mitigation, and recovery of faults. This thesis presents a software-implemented, prototype-based fault injection tool called HPCArrow and a fault injection methodology as a means to investigate and evaluate HPC application and system resiliency. We demonstrate HPCArrow's capabilities through four fault injection campaigns on a Cray XE/XK hybrid testbed, covering single injections, time-varying or delayed injections, and injections during recovery. These injections emulate failures on network and compute components. The results of these campaigns provide insight into application-level and system-level resiliencies. Across various HPC application frameworks, there are notable deficiencies in fault tolerance. Our experiments also revealed a failure phenomenon that was previously unobserved in field data: application hangs, in which forward progress is not made, but jobs are not terminated until the maximum allowed time has elapsed. At the system level, failover procedures prove highly robust on small-scale systems, able to handle both single and multiple faults in the network.

Graduation Semester

2018-12

Type of Resource

text

Permalink

http://hdl.handle.net/2142/102953

Copyright and License Information

Owning Collections

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Resiliency of high-performance computing systems: A fault-injection-based characterization of the high-speed network in the blue waters testbed

Tang, Sharon S.

Permalink

Description

Owning Collections

Dissertations and Theses - Electrical and Computer Engineering

Graduate Dissertations and Theses at Illinois PRIMARY

Log In