Withdraw
Loading…
Infrastructure to enable and exploit GPU orchestrated high-throughput storage access
Qureshi, Zaid
Loading…
Permalink
https://hdl.handle.net/2142/116194
Description
- Title
- Infrastructure to enable and exploit GPU orchestrated high-throughput storage access
- Author(s)
- Qureshi, Zaid
- Issue Date
- 2022-07-08
- Director of Research (if dissertation) or Advisor (if thesis)
- Hwu, Wen-mei
- Doctoral Committee Chair(s)
- Hwu, Wen-mei
- Committee Member(s)
- Torrellas, Josep
- Padua, David
- Chung, I-Hsin
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- GPU
- memory
- storage
- HPC
- graph analytics
- data analytics
- memory wall
- SSD
- Flash
- software cache
- caching
- accelerators
- memory system
- storage system
- memory capacity
- memory capacity wall
- Abstract
- Graphics Processing Units (GPUs) have traditionally relied on the CPU to orchestrate access to the data storage. This approach is well-suited for GPU applications with known data access patterns that enable partitioning of their data-set to be processed in a pipelined fashion in the GPU. However, many emerging applications, such as graph and data analytics, recommender systems, or graph neural networks, require fine-grained, data-dependent access to storage. CPU orchestration of storage access is unsuitable for these applications due to high CPU-GPU synchronization overheads, I/O traffic amplification, and long CPU processing latencies. GPU self-orchestrated storage access avoids these overheads by removing the CPU from the storage control path and, thus, seems better-suited for these applications. However, existing system architectures and software infrastructure lack support for such GPU-orchestrated storage access. In this work, we present a novel system architecture, BaM, that offers mechanisms for GPU code to efficiently access storage and enables GPU self-orchestrated storage access. BaM features a fined-grained, scalable software cache to coalesce data storage requests while minimizing I/O amplification effects. This software cache communicates with the storage system through high-throughput queues that enable the massive number of concurrent threads in modern GPUs to generate I/O requests at a sufficiently high rate to fully utilize the available bandwidth of the interconnect and the storage system. Furthermore, we provide array-based abstractions that not only make integrating BaM into GPU kernels trivial for programmers but also transparently optimize the number of BaM cache accesses by exploiting common GPU thread access patterns. We evaluate the end-to-end performance and efficiency impact of each optimization for each layer of BaM’s software stack with a variety of workloads with multiple data-sets. Experimental results show that GPU self-orchestrated storage access running on BaM delivers 1.04× and 1.05× end-to-end speed up for BFS and CC graph analytics. Our experiments also show GPU self-orchestrated storage access speeds up data-analytics workloads by 4.9× when running on the same hardware. In this work, we show that with carefully optimized systems software on the GPU, it is possible to use solid-state storage as a means to extend the GPU’s effective memory capacity since it provides performance (up to 4.62×), cost (up to 21.8×), and I/O efficiency (up to 3.72×) benefits, even over much more expensive state-of-the-art solutions using fast DRAM, for important GPU accelerated applications.
- Graduation Semester
- 2022-08
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Zaid Qureshi
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…