FastRecover: simple and effective fault recovery in a distributed operator-based stream processing engine

Yaduvanshi, Shashank

FastRecover: simple and effective fault recovery in a distributed operator-based stream processing engine

Yaduvanshi, Shashank

Permalink

https://hdl.handle.net/2142/90832

Description

Title

FastRecover: simple and effective fault recovery in a distributed operator-based stream processing engine

Author(s)

Yaduvanshi, Shashank

Issue Date

2016-04-27

Director of Research (if dissertation) or Advisor (if thesis)

Winslett, Marianne

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Fault recovery
Stateful operators

Abstract

Fault tolerance is a key requirement in large-scale distributed stream processing engines (SPEs), especially those that run atop commodity hardware. Currently, fault tolerance in popular distributed SPEs is either inadequate (e.g., those without automatic recovery of operator states) or complex and inefficient (e.g., those with transactional semantics). There are two major considerations in the design of an effective fault tolerance mechanism: the overhead of additional checkpointing operations during normal processing, and the time required to recover and return to normal processing when a failure happens. The main challenge lies in that faster recovery requires higher checkpointing overhead, and vice versa. This thesis presents FastRecover, a novel fault tolerance mechanism for distributed SPEs that strikes a balance between recovery time and checkpointing overhead. Specifically, given an application topology consisting of interconnected operators, and an upper bound on checkpoint overhead, FastRecover computes the optimal expected recovery time, as well as the strategy used for checkpointing and recovery in each operator. The main idea of FastRecover is to compute an optimal partitioning of the streaming operator topology into independent segments; for each segment, FastRecover backs up its input tuples and periodically checkpoints the states of operators therein. During recovery for a particular segment, FastRecover restores each affected operator state in the segment to the latest checkpoint, and replays the inputs of the segment since then. Both checkpointing and recovery utilize the parallel processing capabilities of the distributed SPE. Extensive experiments demonstrate that FastRecover achieves an average of 50% reduction in expected recovery time compared to simple solutions. The experiments also show that the total expected recovery time varies proportionally to the total computational recovery time and recovery latency in tests with simulated failures, and hence is a good measure to optimize.

Graduation Semester

2016-05

Type of Resource

text

Permalink

http://hdl.handle.net/2142/90832

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

FastRecover: simple and effective fault recovery in a distributed operator-based stream processing engine

Yaduvanshi, Shashank

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In