Understanding the fault-tolerance properties of large-scale storage systems

Rozier, Eric

Understanding the fault-tolerance properties of large-scale storage systems

Rozier, Eric

Permalink

https://hdl.handle.net/2142/29680

Description

Title

Understanding the fault-tolerance properties of large-scale storage systems

Author(s)

Rozier, Eric

Issue Date

2012-02-06T20:10:40Z

Director of Research (if dissertation) or Advisor (if thesis)

Sanders, William H.

Committee Member(s)

Agha, Gul A.
Levinson, Stephen E.
Viswanathan, Mahesh
Zhou, Pin

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Date of Ingest

2012-02-06T20:10:40Z

Keyword(s)

storage systems
modeling
simulation
rare-events
fault-tolerance

Abstract

Modern storage systems continue to increase in scale and complexity as they attempt to meet the increasing storage needs of our society. Additionally, increased requirements to comply with government regulation and consumer expectations have increased the need to make data more available and reliable for longer periods of time. The design of modern and next-generation storage systems is a difficult task that requires high storage capacity and efficiency while also maintaining the data integrity. The rapid advancement of storage system technologies brings with it a level of uncertainty as to the fitness of new designs and methods for meeting the complex requirements. New technologies, like deduplication, promise improved storage efficiency, but their impact on reliability measures is unclear due to the complex relationships inherent to the systems that employ these technologies. Additionally, as systems scale up, they become subject to faults and errors that previous-generation systems may never have encountered due to the rare nature of these faults. Because of the stiffness of the represented systems, and the complex relationships involved, it can be difficult to analyze these environments correctly and efficiently. In this dissertation, we propose a method to analyze storage system reliability by using component-based models coupled with realistic fault models. We solve these complex systems by identifying fault, fault propagation, and mitigation events; by identifying dependence relationships between state variables, events, and rewards; and by decomposing our model at various points during model solution to improve the efficiency of our solution while maintaining the correctness of our reward measures. In particular, we discuss building scalable component-based models of large-scale systems that employ modern reliability methods, such as RAID, and state-of-the-art storage efficiency methods such as deduplication. We present detailed fault models for these systems, including a novel model for undetected disk errors. To enable efficient solution of these models we propose a method to analyze the dependence relationships that underlie storage systems and propose a way to solve these models by identifying and exploiting these relationships when solving for reliability measures. We apply our methods to real-world systems, detail the consequences for the reliability of deduplication, and suggest and evaluate methods to improve reliability while still maintaining improved storage efficiency.

Graduation Semester

2011-12

Permalink

http://hdl.handle.net/2142/29680

Copyright and License Information

Owning Collections

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Understanding the fault-tolerance properties of large-scale storage systems

Rozier, Eric

Permalink

Description

Owning Collections

Dissertations and Theses - Computer Science

Graduate Dissertations and Theses at Illinois PRIMARY

Log In