Systems, models and algorithms for failure diagnosis in networked infrastructure

Vipul Harsh, -

Systems, models and algorithms for failure diagnosis in networked infrastructure

Vipul Harsh, -

Permalink

https://hdl.handle.net/2142/124409

Description

Title

Systems, models and algorithms for failure diagnosis in networked infrastructure

Author(s)

Vipul Harsh, -

Issue Date

2024-04-26

Director of Research (if dissertation) or Advisor (if thesis)

Godfrey, Philip Brighten

Doctoral Committee Chair(s)

Godfrey, Philip Brighten

Committee Member(s)

Chekuri, Chandra Sekhar
Mittal, Radhika
Banerjee, Sujata

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

network troubleshooting
performance diagnosis
fault localization
gray failure localization
debugging distributed systems
root cause analysis
probabilistic graphical models
bayesian networks
markov random fields
flock
murphy
faultference

Abstract

Failure incidents in networked systems such as datacenters, private or public cloud environments, WAN, and enterprise networks lead to significant downtime and violations of service-level agreements, incurring large financial losses. To mitigate failures quickly, operators seek to implement automated failure diagnosis, also known as Root Cause Analysis (RCA). The goal of RCA is to localize faults with high accuracy in a timely manner using practical monitoring telemetry. There are two major challenges in designing such RCA solutions: (1) accurately modeling the behavior of the system and (2) using the model to infer the root causes accurately. Existing works on RCA either (1) fall short in modeling the complexities of an environment and hence utilize imprecise models or (2) utilize an inference algorithm for RCA that is inadequate in extracting high accuracy from the model. In this thesis, we propose solutions for RCA based on two high-level ideas. First, we utilize a suitably designed Probabilistic Graphical Model (PGM) to accurately model the system behavior. A PGM is a model based on an underlying graph structure that, as we show in the thesis, naturally allows us to model the complexities of a distributed system with many components. Second, we design custom training and inference algorithms to extract maximum inference accuracy from the PGM based model. Our algorithms are designed to handle the various characteristics of real-world systems such as large-scale or noisy input monitoring data. Using the high-level ideas outlined above, we propose three RCA systems. In Chapter 2, we describe Murphy, an RCA system for diagnosing performance issues in distributed applications. Murphy models weak inter-dependencies between system components via a Markov Random Field (a type of PGM based on undirected graphs) that can handle cyclic dependencies. With its more precise modeling of inter dependencies, it is able to predict root causes with higher accuracy than past works. In Chapter 3, we describe Flock which localizes unreported failures in datacenter networks. Flock models the problem via a Bayesian network (a type of PGM based on directed acyclic graphs) that can handle the various complexities of a datacenter network such as lack of knowledge of path information of flows or occasional noisy packet drops. Flock utilizes a custom inference algorithm that can speed up inference in discrete-valued PGMs by multiple orders of magnitude. This inference algorithm allows Flock to unlock the benefits of PGM-based modeling for datacenter networks and allows it to achieve higher inference accuracy than past works. In Chapter 4, we describe FaultFerence, a RCA system for detecting faults in datacenter networks with only passive monitoring that obscures path information up to a large set of paths, for all flows. In such cases, even Flock (or any fault localization algorithm) can only localize the fault to a large equivalent set of devices. To localize within this equivalence set, FaultFerence uses a similar model as Flock’s but additionally runs iterations of inference followed by strategically chosen “microactions” that tweak the network ever so slightly which helps in symmetry breaking. Using this approach, FaultFerence can reduce the time to localize faults, while also reducing the invasiveness of the localization procedure compared to ad-hoc techniques employed today. The above systems demonstrate the benefits of principled inference via accurate modeling of system behavior towards achieving high diagnosis accuracy.

Graduation Semester

2024-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/124409

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Systems, models and algorithms for failure diagnosis in networked infrastructure

Vipul Harsh, -

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In