Infrastructure to enable and exploit GPU orchestrated high-throughput storage access

Qureshi, Zaid

Infrastructure to enable and exploit GPU orchestrated high-throughput storage access

Qureshi, Zaid

Permalink

https://hdl.handle.net/2142/116194

Description

Title

Infrastructure to enable and exploit GPU orchestrated high-throughput storage access

Author(s)

Qureshi, Zaid

Issue Date

2022-07-08

Director of Research (if dissertation) or Advisor (if thesis)

Hwu, Wen-mei

Doctoral Committee Chair(s)

Hwu, Wen-mei

Committee Member(s)

Torrellas, Josep
Padua, David
Chung, I-Hsin

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

GPU
memory
storage
HPC
graph analytics
data analytics
memory wall
SSD
Flash
software cache
caching
accelerators
memory system
storage system
memory capacity
memory capacity wall

Abstract

Graphics Processing Units (GPUs) have traditionally relied on the CPU to orchestrate access to the data storage. This approach is well-suited for GPU applications with known data access patterns that enable partitioning of their data-set to be processed in a pipelined fashion in the GPU. However, many emerging applications, such as graph and data analytics, recommender systems, or graph neural networks, require fine-grained, data-dependent access to storage. CPU orchestration of storage access is unsuitable for these applications due to high CPU-GPU synchronization overheads, I/O traffic amplification, and long CPU processing latencies. GPU self-orchestrated storage access avoids these overheads by removing the CPU from the storage control path and, thus, seems better-suited for these applications. However, existing system architectures and software infrastructure lack support for such GPU-orchestrated storage access. In this work, we present a novel system architecture, BaM, that offers mechanisms for GPU code to efficiently access storage and enables GPU self-orchestrated storage access. BaM features a fined-grained, scalable software cache to coalesce data storage requests while minimizing I/O amplification effects. This software cache communicates with the storage system through high-throughput queues that enable the massive number of concurrent threads in modern GPUs to generate I/O requests at a sufficiently high rate to fully utilize the available bandwidth of the interconnect and the storage system. Furthermore, we provide array-based abstractions that not only make integrating BaM into GPU kernels trivial for programmers but also transparently optimize the number of BaM cache accesses by exploiting common GPU thread access patterns. We evaluate the end-to-end performance and efficiency impact of each optimization for each layer of BaM’s software stack with a variety of workloads with multiple data-sets. Experimental results show that GPU self-orchestrated storage access running on BaM delivers 1.04× and 1.05× end-to-end speed up for BFS and CC graph analytics. Our experiments also show GPU self-orchestrated storage access speeds up data-analytics workloads by 4.9× when running on the same hardware. In this work, we show that with carefully optimized systems software on the GPU, it is possible to use solid-state storage as a means to extend the GPU’s effective memory capacity since it provides performance (up to 4.62×), cost (up to 21.8×), and I/O efficiency (up to 3.72×) benefits, even over much more expensive state-of-the-art solutions using fast DRAM, for important GPU accelerated applications.

Graduation Semester

2022-08

Type of Resource

Thesis

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Infrastructure to enable and exploit GPU orchestrated high-throughput storage access

Qureshi, Zaid

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In