Withdraw
Loading…
Systems, models and algorithms for failure diagnosis in networked infrastructure
Vipul Harsh, -
Loading…
Permalink
https://hdl.handle.net/2142/124409
Description
- Title
- Systems, models and algorithms for failure diagnosis in networked infrastructure
- Author(s)
- Vipul Harsh, -
- Issue Date
- 2024-04-26
- Director of Research (if dissertation) or Advisor (if thesis)
- Godfrey, Philip Brighten
- Doctoral Committee Chair(s)
- Godfrey, Philip Brighten
- Committee Member(s)
- Chekuri, Chandra Sekhar
- Mittal, Radhika
- Banerjee, Sujata
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- network troubleshooting
- performance diagnosis
- fault localization
- gray failure localization
- debugging distributed systems
- root cause analysis
- probabilistic graphical models
- bayesian networks
- markov random fields
- flock
- murphy
- faultference
- Abstract
- Failure incidents in networked systems such as datacenters, private or public cloud environments, WAN, and enterprise networks lead to significant downtime and violations of service-level agreements, incurring large financial losses. To mitigate failures quickly, operators seek to implement automated failure diagnosis, also known as Root Cause Analysis (RCA). The goal of RCA is to localize faults with high accuracy in a timely manner using practical monitoring telemetry. There are two major challenges in designing such RCA solutions: (1) accurately modeling the behavior of the system and (2) using the model to infer the root causes accurately. Existing works on RCA either (1) fall short in modeling the complexities of an environment and hence utilize imprecise models or (2) utilize an inference algorithm for RCA that is inadequate in extracting high accuracy from the model. In this thesis, we propose solutions for RCA based on two high-level ideas. First, we utilize a suitably designed Probabilistic Graphical Model (PGM) to accurately model the system behavior. A PGM is a model based on an underlying graph structure that, as we show in the thesis, naturally allows us to model the complexities of a distributed system with many components. Second, we design custom training and inference algorithms to extract maximum inference accuracy from the PGM based model. Our algorithms are designed to handle the various characteristics of real-world systems such as large-scale or noisy input monitoring data. Using the high-level ideas outlined above, we propose three RCA systems. In Chapter 2, we describe Murphy, an RCA system for diagnosing performance issues in distributed applications. Murphy models weak inter-dependencies between system components via a Markov Random Field (a type of PGM based on undirected graphs) that can handle cyclic dependencies. With its more precise modeling of inter dependencies, it is able to predict root causes with higher accuracy than past works. In Chapter 3, we describe Flock which localizes unreported failures in datacenter networks. Flock models the problem via a Bayesian network (a type of PGM based on directed acyclic graphs) that can handle the various complexities of a datacenter network such as lack of knowledge of path information of flows or occasional noisy packet drops. Flock utilizes a custom inference algorithm that can speed up inference in discrete-valued PGMs by multiple orders of magnitude. This inference algorithm allows Flock to unlock the benefits of PGM-based modeling for datacenter networks and allows it to achieve higher inference accuracy than past works. In Chapter 4, we describe FaultFerence, a RCA system for detecting faults in datacenter networks with only passive monitoring that obscures path information up to a large set of paths, for all flows. In such cases, even Flock (or any fault localization algorithm) can only localize the fault to a large equivalent set of devices. To localize within this equivalence set, FaultFerence uses a similar model as Flock’s but additionally runs iterations of inference followed by strategically chosen “microactions” that tweak the network ever so slightly which helps in symmetry breaking. Using this approach, FaultFerence can reduce the time to localize faults, while also reducing the invasiveness of the localization procedure compared to ad-hoc techniques employed today. The above systems demonstrate the benefits of principled inference via accurate modeling of system behavior towards achieving high diagnosis accuracy.
- Graduation Semester
- 2024-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2024 Vipul Harsh
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…