Withdraw
Loading…
Approximate failure recovery in distributed graph processing systems
Leslie, Luke M.
Loading…
Permalink
https://hdl.handle.net/2142/90630
Description
- Title
- Approximate failure recovery in distributed graph processing systems
- Author(s)
- Leslie, Luke M.
- Issue Date
- 2016-04-25
- Director of Research (if dissertation) or Advisor (if thesis)
- Gupta, Indranil
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Failure recovery
- distributed graph processing
- approximate
- Abstract
- Distributed graph processing systems are an emerging area of big data systems. As graphs continue to grow in size and prevalence, these systems must become faster and more scalable. However, after failures, distributed graph processing systems either largely rely on proactive fault tolerance techniques such as checkpointing, or use no fault tolerance mechanisms at all and simply restart computation. The former approach entails significant proactive overheads that increase with the size of the graph, while the latter wastes time and resources in potentially lengthy recomputation. In this thesis, we argue that distributed graph processing systems should instead use a approximate approach to failure recovery that trades off minimal amounts of application accuracy while reducing the overhead during failure-free execution to zero, and allowing fast and scalable recovery. We build a system called Zorro that imbues the approximate reactive approach, and integrate Zorro into two distributed graph processing systems -- PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication (inherent in today's graph processing systems) to quickly and scalably rebuild the state of failed servers. In addition, we describe three other novel failure recovery mechanisms that aim to address several of Zorro's shortcomings. The first utilizes optimistic accuracy results from graph sampling and hence continues after failure without taking any action. The second repartitions the graph after failure to avoid waiting for replacement servers, and then continues computation with the recovered state. The last allows a small amount of proactive overhead to significantly increase the fraction of recovered state. Experiments using five real-world graphs and eight benchmark applications demonstrate that Zorro is able to recover over 99% of the graph state when a few servers fail, and between 87-92% when half the cluster fails, with recovery taking only a fraction of the cost of a single iteration. Furthermore, using eight common graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios. Furthermore, preliminary analysis and experiments using our three alternative approaches suggest that they are able to address many of the potential issues Zorro faces with minimal overhead and accuracy loss.
- Graduation Semester
- 2016-05
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/90630
- Copyright and License Information
- Copyright 2016 Luke M. Leslie
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…