Withdraw
Loading…
Application support and adaptation for high-throughput accelerator orchestrated fine-grain storage access
Mailthody, Vikram Sharma
Loading…
Permalink
https://hdl.handle.net/2142/116210
Description
- Title
- Application support and adaptation for high-throughput accelerator orchestrated fine-grain storage access
- Author(s)
- Mailthody, Vikram Sharma
- Issue Date
- 2022-07-12
- Director of Research (if dissertation) or Advisor (if thesis)
- Hwu, Wen-mei
- Doctoral Committee Chair(s)
- Hwu, Wen-mei
- Committee Member(s)
- Patel, Sanjay
- Chen, Deming
- Chung, I-Hsin
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- GPUs
- Accelerators
- Memory Systems
- Storage Systems
- Systems for Machine Learning
- HPC
- Graph Analytics
- Data Analytics
- Software Caching
- Memory Capacity
- Memory Capacity Wall
- Solid State Drive (SSDs)
- Operating Systems
- Heterogeneous Computing
- Abstract
- Accelerators like Graphics Processing Units (GPUs) have become popular compute devices for HPC, cloud, and machine learning applications because of their compute capabilities and high memory bandwidth. However, GPUs and other accelerators still live within the confines of their modest memory capacity and rely on the inefficient software stack running on the CPUs to orchestrate access to data storage. This CPU-centric data orchestration is well-suited for GPU applications with parallel computation patterns that are flat and regular in nature, like dense neural network training. Unfortunately, many emerging workloads, such as graph and data analytics, recommender systems, or graph neural networks require fine-grain, data-dependent sparse access to storage. The CPU-centric data orchestration of storage accesses is unsuitable for these applications due to high CPU-GPU synchronization overhead, I/O traffic amplification, and excessive CPU software bottlenecks. To overcome these limitations, this work analyzes and shows the feasibility of using GPUs to orchestrate high-throughput fine-grain direct access to storage for emerging workloads. We propose, implement, and evaluate the design of a cost-effective system architecture called BaM (Big Accelerator Memory). BaM capitalizes on the recent improvements in latency, throughput, cost, density, and endurance of solid-state storage devices and systems to realize another level of the accelerator memory hierarchy. BaM is an accelerator-centric approach where GPU threads can identify and orchestrate on-demand access to data where it is stored, be it in memory or storage, without the need to synchronize with the CPU. This significantly decreases the CPU-GPU synchronization overhead, avoids CPU software stack inefficiency, minimizes I/O amplification, and enables GPU programmers to treat storage as memory. However, naively running applications on BaM does not result in performance and efficiency benefits. As BaM essentially extends the accelerator memory hierarchy to the storage, favorable access patterns are needed for BaM to reach its full potential. This is because, on the one hand, BaM requires coalesced accesses for extracting high-throughput out of its cache, while on the other hand, the BaM I/O stack requires many concurrent I/O requests to hide the storage access latency. These conflicting requirements create a design dilemma, motivating the set of sophisticated optimization techniques and application adaptation strategies that allow applications to achieve peak performance on BaM. The proposed techniques, cache-line aware parallel work assignment, and on-demand implicit tiling methods are generalizable across a wide range of data structures and emerging applications. Using these optimizations and application adaptations, we show that BaM is a viable, much less expensive alternative to the existing DRAM-only and other state-of-the-art CPU-centric solutions. Overall, this dissertation proposes a design of a system capable of performing GPU orchestrated storage access to extend the GPU's effective memory capacity and provides a set of generalizable application adaptations that enables application developers to maximize the performance, cost, I/O efficiency, capacity scalability, and simplified software development for emerging workloads, even without additional hardware support. With BaM, the user gets the teraflops of GPU compute capability and terabytes of GPU accessible memory at a low cost.
- Graduation Semester
- 2022-08
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Vikram Sharma Mailthody
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…