Approximate failure recovery in distributed graph processing systems

Leslie, Luke M.

Approximate failure recovery in distributed graph processing systems

Leslie, Luke M.

Content Files

LESLIE-THESIS-2016.pdf

Permalink

https://hdl.handle.net/2142/90630

Description

Title

Approximate failure recovery in distributed graph processing systems

Author(s)

Leslie, Luke M.

Issue Date

2016-04-25

Director of Research (if dissertation) or Advisor (if thesis)

Gupta, Indranil

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Date of Ingest

2016-07-07T19:57:49Z

Keyword(s)

Failure recovery
distributed graph processing
approximate

Abstract

Distributed graph processing systems are an emerging area of big data systems. As graphs continue to grow in size and prevalence, these systems must become faster and more scalable. However, after failures, distributed graph processing systems either largely rely on proactive fault tolerance techniques such as checkpointing, or use no fault tolerance mechanisms at all and simply restart computation. The former approach entails significant proactive overheads that increase with the size of the graph, while the latter wastes time and resources in potentially lengthy recomputation. In this thesis, we argue that distributed graph processing systems should instead use a approximate approach to failure recovery that trades off minimal amounts of application accuracy while reducing the overhead during failure-free execution to zero, and allowing fast and scalable recovery. We build a system called Zorro that imbues the approximate reactive approach, and integrate Zorro into two distributed graph processing systems -- PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication (inherent in today's graph processing systems) to quickly and scalably rebuild the state of failed servers. In addition, we describe three other novel failure recovery mechanisms that aim to address several of Zorro's shortcomings. The first utilizes optimistic accuracy results from graph sampling and hence continues after failure without taking any action. The second repartitions the graph after failure to avoid waiting for replacement servers, and then continues computation with the recovered state. The last allows a small amount of proactive overhead to significantly increase the fraction of recovered state. Experiments using five real-world graphs and eight benchmark applications demonstrate that Zorro is able to recover over 99% of the graph state when a few servers fail, and between 87-92% when half the cluster fails, with recovery taking only a fraction of the cost of a single iteration. Furthermore, using eight common graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios. Furthermore, preliminary analysis and experiments using our three alternative approaches suggest that they are able to address many of the potential issues Zorro faces with minimal overhead and accuracy loss.

Graduation Semester

2016-05

Type of Resource

text

Permalink

http://hdl.handle.net/2142/90630

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Approximate failure recovery in distributed graph processing systems

Leslie, Luke M.

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In