Withdraw
Loading…
Failure diagnosis in distributed systems
Seo, Eunsoo
Loading…
Permalink
https://hdl.handle.net/2142/34467
Description
- Title
- Failure diagnosis in distributed systems
- Author(s)
- Seo, Eunsoo
- Issue Date
- 2012-09-18T21:18:33Z
- Director of Research (if dissertation) or Advisor (if thesis)
- Abdelzaher, Tarek F.
- Doctoral Committee Chair(s)
- Abdelzaher, Tarek F.
- Committee Member(s)
- Han, Jiawei
- Vaidya, Nitin H.
- Ko, Steven
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Debugging
- Bug Diagnosis
- Concurrency Bugs
- Error Propagation
- Abstract
- Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose failures early to improve the reliability of systems. In this dissertation, new approaches on root-cause diagnosis for two notorious types of failures in distributed systems are introduced. This dissertation first focuses on the failures that are caused by software bugs triggered by race conditions. Due to the non-deterministic manifestation, these bugs are much harder to diagnose, fix and test than the bugs in sequential logic. To understand the concurrency bugs, we first study the characteristics of concurrency bugs using 105 bugs of four representative open-source programs. Motivated by the interesting findings from the study, we also propose an automatic bug diagnosis tool for distributed programs that finds the minimal causal orders of related events that trigger the bugs. Our tool is a significant extension to the previous tools that can find only bug-triggering sequence of events. The second focus of this dissertation is on the failures that are caused by propagating errors. An error started by a single network component propagates and contaminates other components. As a result, a large number of network components are infected by errors. To fix the problem, root-cause of this problem, the single component that started the error propagation, needs to be identified. It is assumed that only a limited view on the status of components -- whether they are infected or not -- are available through monitors, a set of pre-selected network components. For this problem, we propose two root-cause diagnosis tools. The first tool relies on a simple intuition that the root-cause component is likely to be close to the infected monitors and far from the uninfected monitors. We also compare six different monitor selection methods. The second tool makes use of additional information -- failure propagation probability and time of infections -- to improve the accuracy of root-cause diagnosis. We propose approximation algorithms to calculate the likelihood that a node is the failure source. In addition, we also propose a new monitor selection algorithm that maximizes the number of infected monitors for best accuracy of root-cause diagnosis.
- Graduation Semester
- 2012-08
- Permalink
- http://hdl.handle.net/2142/34467
- Copyright and License Information
- Copyright 2012 Eun Soo Seo
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…