Improving reliability and security monitoring in enterprise and cloud systems by leveraging information redundancy

Thakore, Uttam

Improving reliability and security monitoring in enterprise and cloud systems by leveraging information redundancy

Thakore, Uttam

Permalink

https://hdl.handle.net/2142/109623

Description

Title

Improving reliability and security monitoring in enterprise and cloud systems by leveraging information redundancy

Author(s)

Thakore, Uttam

Issue Date

2020-12-03

Director of Research (if dissertation) or Advisor (if thesis)

Sanders, William H

Doctoral Committee Chair(s)

Sanders, William H

Committee Member(s)

Gupta, Indranil
Nahrstedt, Klara
Ranchal, Rohit
Ramasamy, Harigovind V

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

monitoring
reliability
security
compliance audit
cloud computing
incident detection
incident response

Abstract

As computing has become critical to all areas of modern life, the need to ensure the security and reliability of the underlying information technology infrastructures is greater than ever before. Large-scale enterprise and cloud systems, which form the backbone for the majority of computing activity, consist of many components and services interacting in complex and sometimes unpredictable ways. As such systems have grown in size, scale, and complexity, they have become increasingly difficult to protect against security and reliability incidents, resulting over recent years in ever more frequent service disruptions, failures, and data breaches, the financial and societal implications of which are massive. System owners have a strong desire to prevent such incidents. Incident detection and response and compliance audit are the two primary mechanisms by which organizations enforce reliability and security policies and make their systems more resilient. Both the academic and professional communities have focused considerable attention on developing techniques to improve incident detection, incident root cause analysis, and compliance auditing, often with little consideration for the cost of the monitoring that is required to support them. Furthermore, as the scale and complexity of systems have increased, so too have the scale and complexity of their monitoring infrastructures. Monitors can fail or be compromised, and monitor data must be selectively collected to avoid exceeding storage and processing limits. Consequently, it has become increasingly important to explicitly consider the efficiency, efficacy, and resiliency of monitoring systems when one is designing large-scale enterprise and cloud systems. In this dissertation, we address inefficiencies and inadequacies in reliability and security monitoring in enterprise and cloud systems by leveraging redundancy of information across diverse monitors. In particular, we use the redundancy of data generated by different monitors 1) to facilitate more effective and efficient use of the data in meeting reliability and security objectives, and 2) to improve the resiliency of the monitoring infrastructure itself against failures and attacks. First, we present a framework for simplifying the complexity of data analysis for incident response in enterprise cloud systems. As a foundation for the framework, we define a general taxonomy for fields within monitor data that administrators can use to label both structured and unstructured components of data. We then present a method to automatically extract time series features based on labels from our taxonomy, remove uninformative features, and reduce the overall number of features by clustering together related and redundant features. We apply our framework to logs and metrics collected during reliability incidents from all levels of an experimental platform-as-a-service cloud at a large computing organization, and demonstrate that our approach enables efficient coordinated analysis of both metric data and log data. Such analysis typically presents a challenge to cloud support engineers, but can identify meaningful relationships between features that can aid in incident response. Next, we present a systematic methodology that enables system administrators to design monitoring systems that are resilient to missing data. We develop a model-based approach to quantify the resilience of a system's monitoring and incident detection infrastructure design against missing data, using which we develop a method to find monitor deployments that maximize resilience subject to monitoring cost constraints. We illustrate how our approach can be applied to production systems by using a datacenter network case study model based on monitors employed in production systems, and we evaluate its scalability by using randomly generated models of varying sizes and structures. We compare our approach to the current state of the art and demonstrate that our approach consistently finds monitor deployments that are more resilient under the same constraints. Finally, we address the inefficiencies faced by a cloud service provider (CSP) during audit evidence collection as a result of a poor understanding of evidence requirements. We motivate our analysis by developing a taxonomic framework for understanding the causes of and potential solutions to uncertainty in audit. We present a model-driven method to learn evidence sufficiency requirements directly from historical audit records. We then apply our cost-optimal resilient monitoring approach to the evidence sufficiency model to determine an efficient evidence collection strategy for the CSP. We apply our approach to the historical audit records from an enterprise infrastructure-as-a-service cloud system at a large computing organization and demonstrate how use of our approach could have enabled more efficient evidence collection. We believe that our work clearly demonstrates the need to critically examine the resiliency and efficiency of monitoring infrastructures in enterprise and cloud systems. This dissertation presents solutions to specific challenges faced by practitioners when monitoring their systems for reliability and security objectives, but our work addresses only part of the larger problem space of resilient monitoring system design. We hope that this dissertation paves the way for future research that focuses on the resilience of the monitoring infrastructure itself.

Graduation Semester

2020-12

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/109623

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Improving reliability and security monitoring in enterprise and cloud systems by leveraging information redundancy

Thakore, Uttam

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In