Application support and adaptation for high-throughput accelerator orchestrated fine-grain storage access

Mailthody, Vikram Sharma

Application support and adaptation for high-throughput accelerator orchestrated fine-grain storage access

Mailthody, Vikram Sharma

Permalink

https://hdl.handle.net/2142/116210

Description

Title

Application support and adaptation for high-throughput accelerator orchestrated fine-grain storage access

Author(s)

Mailthody, Vikram Sharma

Issue Date

2022-07-12

Director of Research (if dissertation) or Advisor (if thesis)

Hwu, Wen-mei

Doctoral Committee Chair(s)

Hwu, Wen-mei

Committee Member(s)

Patel, Sanjay
Chen, Deming
Chung, I-Hsin

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

GPUs
Accelerators
Memory Systems
Storage Systems
Systems for Machine Learning
HPC
Graph Analytics
Data Analytics
Software Caching
Memory Capacity
Memory Capacity Wall
Solid State Drive (SSDs)
Operating Systems
Heterogeneous Computing

Abstract

Accelerators like Graphics Processing Units (GPUs) have become popular compute devices for HPC, cloud, and machine learning applications because of their compute capabilities and high memory bandwidth. However, GPUs and other accelerators still live within the confines of their modest memory capacity and rely on the inefficient software stack running on the CPUs to orchestrate access to data storage. This CPU-centric data orchestration is well-suited for GPU applications with parallel computation patterns that are flat and regular in nature, like dense neural network training. Unfortunately, many emerging workloads, such as graph and data analytics, recommender systems, or graph neural networks require fine-grain, data-dependent sparse access to storage. The CPU-centric data orchestration of storage accesses is unsuitable for these applications due to high CPU-GPU synchronization overhead, I/O traffic amplification, and excessive CPU software bottlenecks. To overcome these limitations, this work analyzes and shows the feasibility of using GPUs to orchestrate high-throughput fine-grain direct access to storage for emerging workloads. We propose, implement, and evaluate the design of a cost-effective system architecture called BaM (Big Accelerator Memory). BaM capitalizes on the recent improvements in latency, throughput, cost, density, and endurance of solid-state storage devices and systems to realize another level of the accelerator memory hierarchy. BaM is an accelerator-centric approach where GPU threads can identify and orchestrate on-demand access to data where it is stored, be it in memory or storage, without the need to synchronize with the CPU. This significantly decreases the CPU-GPU synchronization overhead, avoids CPU software stack inefficiency, minimizes I/O amplification, and enables GPU programmers to treat storage as memory. However, naively running applications on BaM does not result in performance and efficiency benefits. As BaM essentially extends the accelerator memory hierarchy to the storage, favorable access patterns are needed for BaM to reach its full potential. This is because, on the one hand, BaM requires coalesced accesses for extracting high-throughput out of its cache, while on the other hand, the BaM I/O stack requires many concurrent I/O requests to hide the storage access latency. These conflicting requirements create a design dilemma, motivating the set of sophisticated optimization techniques and application adaptation strategies that allow applications to achieve peak performance on BaM. The proposed techniques, cache-line aware parallel work assignment, and on-demand implicit tiling methods are generalizable across a wide range of data structures and emerging applications. Using these optimizations and application adaptations, we show that BaM is a viable, much less expensive alternative to the existing DRAM-only and other state-of-the-art CPU-centric solutions. Overall, this dissertation proposes a design of a system capable of performing GPU orchestrated storage access to extend the GPU's effective memory capacity and provides a set of generalizable application adaptations that enables application developers to maximize the performance, cost, I/O efficiency, capacity scalability, and simplified software development for emerging workloads, even without additional hardware support. With BaM, the user gets the teraflops of GPU compute capability and terabytes of GPU accessible memory at a low cost.

Graduation Semester

2022-08

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/116210

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Application support and adaptation for high-throughput accelerator orchestrated fine-grain storage access

Mailthody, Vikram Sharma

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In