Withdraw
Loading…
Battling Failures
Gainaru, Ana
Loading…
Permalink
https://hdl.handle.net/2142/49203
Description
- Title
- Battling Failures
- Author(s)
- Gainaru, Ana
- Issue Date
- 2014-05
- Keyword(s)
- Computer Science
- Abstract
- A large percentage of computing capacity in todays large high-performance computing systems is wasted due to failures and recoveries. The fear in our community is that future Exascale systems will fail so frequently that no useful work will be possible. My research is focusing on characterizing the events generated at the hardware, system or application level by understanding the complex correlations between different system components. This information is used to predict failures and as a consequence to minimize or prevent their effects on running applications. The image represents an overview of the overall analysis process: monitoring applications and their performance, modeling the system and the way anomalies propagate between components, analyzing the current state, diagnosing errors and predicting failures. The size and complexity of today's supercomputers is too large to manually inspector visualize all the events that occur during an application's execution. With tools like this, that adapt and learn as the system experiences new events, applications are allowed to take preventive actions that will increase their efficiency and as a consequence will allow them to complete their task even on future Exascale machines.Credits: Images provided by the National Center for Supercomputing Applications Visualization Laboratory.
- Type of Resource
- text
- image
- Permalink
- http://hdl.handle.net/2142/49203
- Copyright and License Information
- Copyright 2014 Ana Gainaru
Owning Collections
Manage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…