Understanding and improving availability of reliable distributed storage systems

Cui, Shengkun

Understanding and improving availability of reliable distributed storage systems

Cui, Shengkun

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/104004

Description

Title

Understanding and improving availability of reliable distributed storage systems

Author(s)

Cui, Shengkun

Contributor(s)

Jha, Saurabh
Kalbarczyk, Zbigniew

Issue Date

2019-05

Keyword(s)

Distributed File System; Distributed Storage System; Lustre File System; Blue Waters; Failure Characterization; Data Analysis; Machine Learning; Long Short-term Memory; Recurrent Neural Network; System Probing; Failure Detection; Failure Prediction

Abstract

From genomic sequencing to weather forecasting, high-performance computing systems (HPCs) have profound impacts on scientific breakthroughs and people’s everyday lives. Failures in a HPC environment can result in partial or system-wide outages leading to performance degradation of the applications, wasting computational resource. Recent studies on the availability and reliability of HPC systems have shown that storage system failures are one of the major limiting factors for achieving high system utility. However, there is limited understanding of the storage system failures, their propagation, and impact on application performance. Using statistical analysis and machine learning techniques, we characterize I/O failures in a distributed storage system and their impacts on the applications. The target storage system is the storage system used in Blue Waters, a petascale supercomputer at the University of Illinois at Urbana-Champaign, running Lustre filesystem. Driven by the characterization results, we use a Long Short-term Memory (a type of Recurrent Neural Network) (LSTM) to support runtime detection and localization of failures to a per-storage server granularity. In this thesis, we present an overview of the project, Blue Waters storage system architecture and specifications, Lustre file system background information, Blue Waters storage system failure characterization on NCSA Maintenance Logs, Storage Server Logs and Quality of Service (QoS) measurements, and the machine learning models for runtime failure detection. We also include key algorithms for data cleaning, processing, and analysis, and performance evaluation of the machine learning model for runtime failure detection. Furthermore, we present an extension of our study---using the model we developed for failure prediction.

Type of Resource

text

Permalink

http://hdl.handle.net/2142/104004

Owning Collections

Senior Theses - Electrical and Computer Engineering PRIMARY

The best of ECE undergraduate research

Understanding and improving availability of reliable distributed storage systems

Cui, Shengkun

Permalink

Description

Owning Collections

Senior Theses - Electrical and Computer Engineering PRIMARY

Log In