Withdraw
Loading…
FastRecover: simple and effective fault recovery in a distributed operator-based stream processing engine
Yaduvanshi, Shashank
Loading…
Permalink
https://hdl.handle.net/2142/90832
Description
- Title
- FastRecover: simple and effective fault recovery in a distributed operator-based stream processing engine
- Author(s)
- Yaduvanshi, Shashank
- Issue Date
- 2016-04-27
- Director of Research (if dissertation) or Advisor (if thesis)
- Winslett, Marianne
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Fault recovery
- Stateful operators
- Abstract
- Fault tolerance is a key requirement in large-scale distributed stream processing engines (SPEs), especially those that run atop commodity hardware. Currently, fault tolerance in popular distributed SPEs is either inadequate (e.g., those without automatic recovery of operator states) or complex and inefficient (e.g., those with transactional semantics). There are two major considerations in the design of an effective fault tolerance mechanism: the overhead of additional checkpointing operations during normal processing, and the time required to recover and return to normal processing when a failure happens. The main challenge lies in that faster recovery requires higher checkpointing overhead, and vice versa. This thesis presents FastRecover, a novel fault tolerance mechanism for distributed SPEs that strikes a balance between recovery time and checkpointing overhead. Specifically, given an application topology consisting of interconnected operators, and an upper bound on checkpoint overhead, FastRecover computes the optimal expected recovery time, as well as the strategy used for checkpointing and recovery in each operator. The main idea of FastRecover is to compute an optimal partitioning of the streaming operator topology into independent segments; for each segment, FastRecover backs up its input tuples and periodically checkpoints the states of operators therein. During recovery for a particular segment, FastRecover restores each affected operator state in the segment to the latest checkpoint, and replays the inputs of the segment since then. Both checkpointing and recovery utilize the parallel processing capabilities of the distributed SPE. Extensive experiments demonstrate that FastRecover achieves an average of 50% reduction in expected recovery time compared to simple solutions. The experiments also show that the total expected recovery time varies proportionally to the total computational recovery time and recovery latency in tests with simulated failures, and hence is a good measure to optimize.
- Graduation Semester
- 2016-05
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/90832
- Copyright and License Information
- Copyright 2016 Shashank Yaduvanshi
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…