Withdraw
Loading…
Troubleshooting interactive complexity bugs
Khan, Mohammad M.
Loading…
Permalink
https://hdl.handle.net/2142/29503
Description
- Title
- Troubleshooting interactive complexity bugs
- Author(s)
- Khan, Mohammad M.
- Issue Date
- 2012-02-01T00:49:40Z
- Director of Research (if dissertation) or Advisor (if thesis)
- Abdelzaher, Tarek F.
- Doctoral Committee Chair(s)
- Abdelzaher, Tarek F.
- Committee Member(s)
- Han, Jiawei
- Sha, Lui R.
- Liu, Jie
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- interactive complexity bugs
- discriminative sequence mining
- troubleshooting
- Abstract
- The term “interactive complexity” was introduced by Charles Perrow in his famous book Normal Accidents: Living with High-Risk Technologies [1]. He used the term to describe the interacting tendency of systems with large number of components. He argued that, in systems with large number of components, multiple failures often interact in some unexpected way, leading to catastrophic failures in systems such as planes or nuclear power plants. He also suggested that with increasing interactive complexity and tight coupling, unexpected interactions of failures are bound to happen. Indeed, with the proliferation of Internet enabled cheap embedded devices with built in sensors and actuators (e.g., smart phones, smart appliances), the physical world is increasingly becoming an integral part of the logical world of computation. As computing systems are becoming much more interactive and responsive to the surrounding physical environments, it is becoming increasingly difficult to test such systems to full extent before deployment in real world. Hence, due to increased interactive complexity and tight coupling between physical and logical world, such systems often fail or preform poorly once deployed in real life. Unintended interactions among various system components, or across computing systems and physical environments are often to blame for the problem. With this growing trend, the bugs that arise due to interaction among different distributed components across multiple nodes are likely to get worse, and are going to affect the reliability of the system significantly. This calls for new tools and techniques to troubleshoot future software systems. In this dissertation, we address this significant challenge of troubleshooting interactive complexity bugs in emerging cyber-physical systems using data mining techniques. More specifically, we applied discriminative sequence mining algorithm to isolate chains of events (not necessarily contiguous) that is causally correlated to failure by analyzing system logs. In the first part of our thesis, using our tool, we successfully identified multiple bugs in various real systems such as multi-channel MAC (medium access control) layer protocol for wireless sensor network [2], kernel level race condition bug in the LiteOS operating system, and corner case design flaw in the directed diffusion protocol [3]. Next, we extended our approach to identify “symbolic” patterns, where absolute values are replaced with abstract symbols whenever appropriate to identify more subtle patterns across multiple system logs. Next, we have examined the applicability of our approach to troubleshoot harmful interactive complexity that may arise due to poor integration of adaptive components in server clusters. More specifically, we extended our approach to identify “cyclic” patterns in data center applications, which potentially highlights self-reinforcing loops. Finally, to complement our work on troubleshooting interactive complexity, we address the challenge of diagnosing occasional “lack of interaction” in deployed system. Such “lack of interaction” is often caused by unresponsive nodes. We develop the tele-diagnostic powertracer, an in-situ troubleshooting tool that uses external power measurements to determine the internal health condition of an unresponsive host and the most likely cause of its failure. Using our tool, we successfully distinguish between several categories of failures that cause unresponsive behavior including energy depletion, antenna damage, radio disconnection, system crashes, and anomalous reboots. To the best of our knowledge, we are the first to present a diagnostic tool that uses power measurements to diagnose sensor system failures remotely.
- Graduation Semester
- 2011-12
- Permalink
- http://hdl.handle.net/2142/29503
- Copyright and License Information
- Copyright 2011 Mohammad Maifi Hasan Khan.
Owning Collections
Dissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceGraduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…