Understanding and improving availability of reliable distributed storage systems
Cui, Shengkun
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/104004
Description
Title
Understanding and improving availability of reliable distributed storage systems
Author(s)
Cui, Shengkun
Contributor(s)
Jha, Saurabh
Kalbarczyk, Zbigniew
Issue Date
2019-05
Keyword(s)
Distributed File System; Distributed Storage System; Lustre File System; Blue Waters; Failure Characterization; Data Analysis; Machine Learning; Long Short-term Memory; Recurrent Neural Network; System Probing; Failure Detection; Failure Prediction
Abstract
From genomic sequencing to weather forecasting, high-performance computing systems (HPCs)
have profound impacts on scientific breakthroughs and people’s everyday lives. Failures in a HPC
environment can result in partial or system-wide outages leading to performance degradation of the
applications, wasting computational resource. Recent studies on the availability and reliability of HPC
systems have shown that storage system failures are one of the major limiting factors for achieving high
system utility. However, there is limited understanding of the storage system failures, their propagation,
and impact on application performance.
Using statistical analysis and machine learning techniques, we characterize I/O failures in a
distributed storage system and their impacts on the applications. The target storage system is the storage
system used in Blue Waters, a petascale supercomputer at the University of Illinois at Urbana-Champaign,
running Lustre filesystem. Driven by the characterization results, we use a Long Short-term Memory (a
type of Recurrent Neural Network) (LSTM) to support runtime detection and localization of failures to a
per-storage server granularity.
In this thesis, we present an overview of the project, Blue Waters storage system architecture and
specifications, Lustre file system background information, Blue Waters storage system failure
characterization on NCSA Maintenance Logs, Storage Server Logs and Quality of Service (QoS)
measurements, and the machine learning models for runtime failure detection. We also include key
algorithms for data cleaning, processing, and analysis, and performance evaluation of the machine learning
model for runtime failure detection. Furthermore, we present an extension of our study---using the model
we developed for failure prediction.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.