Withdraw
Loading…
A semi-blocking checkpoint protocol to minimize checkpoint overhead
Ni, Xiang
Loading…
Permalink
https://hdl.handle.net/2142/31066
Description
- Title
- A semi-blocking checkpoint protocol to minimize checkpoint overhead
- Author(s)
- Ni, Xiang
- Issue Date
- 2012-05-22T00:25:37Z
- Director of Research (if dissertation) or Advisor (if thesis)
- Kale, Laxmikant V.
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Fault Tolerance
- Checkpoint
- High Performance Computing (HPC)
- Abstract
- The increasing number of cores on current supercomputers will quickly decrease the mean time to failures (MTTF) of the system. With such high failure rates, long time running applications will have little chance to complete successfully if they don’t use any fault tolerance strategy. Double in memory/disk checkpointing is a production fault tolerance strategy in Charm++ runtime system. Each node will store one copy of its checkpoint in its own memory or disk as a local checkpoint and another copy in other node’s memory or disk as a global checkpoint. This method takes advantage of the relatively high network bandwidth compared to I/O bandwidth. It is able to store a checkpoint faster than the traditional NFS- based checkpoint/restart. However, as the core counts on each node keep increasing, the large checkpoint size of a node will quickly saturate the limited network bandwidth. In this thesis, we introduce the semi-blocking checkpoint/restart protocol to hide the checkpoint overhead by overlapping global checkpoint with applications. To further analyze the benefits of using semi-blocking checkpoint protocol in case of failures, we extend Daly’s model and show the usefulness of the semi-blocking protocol for different kinds of applications. Solid state disk (SSD) is used in the semi-blocking checkpoint protocol when there is no space to store checkpoint in memory. We present two strategies to choose what data to store in SSD based on the memory usage of applications. In this thesis, we show the scalability and benefits of the semi-blocking checkpoint protocol. Semi-blocking checkpoint protocol has a performance improvement of 20% compared to blocking checkpoint. And the overhead of semi-blocking checkpoint protocol could be as low as 1.6% with the consideration of checkpoints dumping time and the extra time to recover applications from failures.
- Graduation Semester
- 2012-05
- Permalink
- http://hdl.handle.net/2142/31066
- Copyright and License Information
- Copyright 2012 Xiang Ni
Owning Collections
Dissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceGraduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…