Withdraw
Loading…
Exploration of fault tolerance in Apache Spark
Gupta, Akshun
Loading…
Permalink
https://hdl.handle.net/2142/99383
Description
- Title
- Exploration of fault tolerance in Apache Spark
- Author(s)
- Gupta, Akshun
- Issue Date
- 2017-12-06
- Director of Research (if dissertation) or Advisor (if thesis)
- Gupta, Indranil
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Apache Spark
- Fault tolerance
- Abstract
- This thesis provides an exploration of two techniques for solving fault tolerance for batch processing in Apache Spark. We evaluate the benefits and challenges of these approaches. Apache Spark is a cluster computing system comprised of three main components: the driver program, the cluster manager, and the worker nodes. Spark already tolerates the loss of worker nodes, and other external tools already provide fault tolerance solutions for the cluster manager. For example, the cluster manager deployed using Apache Mesos provides fault tolerance to the cluster manager. Spark does not support driver fault tolerance for batch processing. The driver program stores critical state of the running job by maintaining oversight of the workers; failure of the driver program always results in loss of all oversight over the worker nodes and is equivalent to catastrophic failure of the entire Spark application. In this thesis, we explore two approaches to achieve fault tolerance in Apache Spark for batch processing, enabling promised execution of long-running critical jobs and consistent performance while still supporting high uptime. The first approach serializes critical state of the driver program and relay that state to passive processors. Upon failure, this state is loaded by a secondary processor and computation is resumed. The second approach narrows the scope of the problem and synchronizes block information between primary and secondary drivers so that locations of cached aggregated data is not lost after primary driver failure. Loss of these locations leads to a state from which computation cannot be resumed. Both approaches propose considerable changes to the Apache Spark architecture in order to support high availability of batch processing jobs.
- Graduation Semester
- 2017-12
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/99383
- Copyright and License Information
- Copyright 2017 Akshun Gupta
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…