Exploration of fault tolerance in Apache Spark

Gupta, Akshun

Exploration of fault tolerance in Apache Spark

Gupta, Akshun

Permalink

https://hdl.handle.net/2142/99383

Description

Title

Exploration of fault tolerance in Apache Spark

Author(s)

Gupta, Akshun

Issue Date

2017-12-06

Director of Research (if dissertation) or Advisor (if thesis)

Gupta, Indranil

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Apache Spark
Fault tolerance

Abstract

This thesis provides an exploration of two techniques for solving fault tolerance for batch processing in Apache Spark. We evaluate the benefits and challenges of these approaches. Apache Spark is a cluster computing system comprised of three main components: the driver program, the cluster manager, and the worker nodes. Spark already tolerates the loss of worker nodes, and other external tools already provide fault tolerance solutions for the cluster manager. For example, the cluster manager deployed using Apache Mesos provides fault tolerance to the cluster manager. Spark does not support driver fault tolerance for batch processing. The driver program stores critical state of the running job by maintaining oversight of the workers; failure of the driver program always results in loss of all oversight over the worker nodes and is equivalent to catastrophic failure of the entire Spark application. In this thesis, we explore two approaches to achieve fault tolerance in Apache Spark for batch processing, enabling promised execution of long-running critical jobs and consistent performance while still supporting high uptime. The first approach serializes critical state of the driver program and relay that state to passive processors. Upon failure, this state is loaded by a secondary processor and computation is resumed. The second approach narrows the scope of the problem and synchronizes block information between primary and secondary drivers so that locations of cached aggregated data is not lost after primary driver failure. Loss of these locations leads to a state from which computation cannot be resumed. Both approaches propose considerable changes to the Apache Spark architecture in order to support high availability of batch processing jobs.

Graduation Semester

2017-12

Type of Resource

text

Permalink

http://hdl.handle.net/2142/99383

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Exploration of fault tolerance in Apache Spark

Gupta, Akshun

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In