Withdraw
Loading…
Providing application-aware reliability through OS/hypervisor-level techniques
Wang, Long
Loading…
Permalink
https://hdl.handle.net/2142/18440
Description
- Title
- Providing application-aware reliability through OS/hypervisor-level techniques
- Author(s)
- Wang, Long
- Issue Date
- 2011-01-14T22:50:58Z
- Director of Research (if dissertation) or Advisor (if thesis)
- Iyer, Ravishankar K.
- Doctoral Committee Chair(s)
- Iyer, Ravishankar K.
- Committee Member(s)
- Lumetta, Steven S.
- Parthasarathy, Madhusudan
- Vasudevan, Shobha
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- checkpoint
- reliability, virtual machine
- hypervisor
- system hang
- microkernel
- operating system
- error detection
- error injection
- Abstract
- Operating systems and hypervisors enable the collection and extraction of rich information on application and system execution characteristics. This thesis describes a Reliability MicroKernel (RMK) architecture, which provides an infrastructure that enables the design and deployment of software modules for providing application-aware error detection and recovery. The purpose of the RMK is to provide an automatic approach for low-latency crash/hang detection and rapid recovery via checkpoint. We first demonstrate how the RMK works in a native system and then enhance the RMK to work in VMs. In a native system, the RMK is installed as a device driver, while in a virtualized system, the RMK is both installed as a device driver in VMs and deployed as a hypercall (which is like a system call) in a hypervisor. Our approach is transparent to applications and VMs, i.e., it is not required to modify or recompile the kernel source code in a native system or in a VM. The implemented RMK modules include OS/application crash detection, system hang detection, and transparent checkpoint. Traditionally, an external hardware watchdog is used to force a system reboot whenever the watchdog is not reset within a predefined timeout interval. The detection latency might be significant because the timeout interval for resetting the watchdog timer is usually a matter of seconds to reduce false alarms. The approach in this thesis enables low-latency OS-hang detection (within hundreds of milliseconds or less) by measuring the count of instructions executed between two consecutive context switches and checking if the count exceeds a predefined threshold value. The RMK is enhanced to support virtualized environments. Specifically, we present the description, implementation, and experimental assessment of VM-μCheckpoint, a VM checkpointing framework to protect both the guest OS and applications against runtime errors. Compared with the existing VM checkpoint techniques, our VM-μCheckpoint has small overhead and rapid recovery, handles non-fail-stop errors, and runs at high frequency (tens of checkpoints per second) to reduce the recomputation necessary when recovering a VM from a failure. The key point of VM-μCheckpoint is that we do an incremental checkpoint by considering the whole memory of the protected VM as part of the checkpoint. The RMK prototype has been implemented in both Linux and Windows systems on a Pentium 4 processor and is also implemented in the Xen VMM. (The Xen hypervisor is recompiled for installing RMK, but the OS of a native system or a VM is not recompiled.) Error injection experiments show that our RMK detects all the crashes and system hangs, and VM-μCheckpoint successfully recovers VMs from all the crashes. Moreover, the experimental evaluation of the RMK using real-world applications shows that we achieve high coverage and low false-positive rates for error detection (e.g., no false positives for system hang detection) as well as low overhead in providing checkpoint and recovery (e.g., an average of 6.3% overhead in VM-μCheckpoint for SPEC benchmark programs with 50 ms checkpoint intervals). We also apply a formal method and analytical/probilistic models to verify the capability of our system hang detection and to study the availability enhancement provided by the RMK.
- Graduation Semester
- 2010-12
- Permalink
- http://hdl.handle.net/2142/18440
- Copyright and License Information
- Copyright 2010 Long Wang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer EngineeringManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…