Withdraw
Loading…
Movement and placement of non-contiguous data in distributed GPU computing
Pearson, Carl
Loading…
Permalink
https://hdl.handle.net/2142/110511
Description
- Title
- Movement and placement of non-contiguous data in distributed GPU computing
- Author(s)
- Pearson, Carl
- Issue Date
- 2021-04-20
- Director of Research (if dissertation) or Advisor (if thesis)
- Hwu, Wen-Mei
- Doctoral Committee Chair(s)
- Hwu, Wen-Mei
- Committee Member(s)
- Lumetta, Steven
- Olson, Luke
- Patel, Sanjay
- Xiong, Jinjun
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- CUDA, MPI, stencil, GPU
- Abstract
- A steady increase in accelerator performance has driven demand for faster interconnects to avert the memory bandwidth wall. This has resulted in the wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnects hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This work finds that empirical communication measurements can be used to automatically schedule and execute intra- and inter-node communication in a modern heterogeneous system, providing ``hand-tuned'' performance without the need for complex or error-prone communication development at the application level. Empirical measurements are provided by a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. These benchmarks are the first comprehensive evaluation of all GPU communication primitives. For communication-heavy applications, optimally using communication capabilities is challenging and essential for performance. Two different approaches are examined. The first is a high-level 3D stencil communication library, which can automatically create a static communication plan based on the stencil and system parameters. This library is able to reduce the iteration time of a state-of-the-art stencil code by 1.45x at 3072 GPUs and 512 nodes. The second is a more general MPI interposer library, with novel non-contiguous data handling and runtime implementation selection for MPI communication primitives. A portable pure-MPI halo exchange is brought to within half the speed of the stencil-specific library, supported by a five order-of-magnitude improvement in MPI communication latency for non-contiguous data.
- Graduation Semester
- 2021-05
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/110511
- Copyright and License Information
- Copyright Carl Pearson 2021
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer EngineeringManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…