Techniques for optimizing dynamic parallelism on graphics processing units

El Hajj, Izzat

Techniques for optimizing dynamic parallelism on graphics processing units

El Hajj, Izzat

Permalink

https://hdl.handle.net/2142/102488

Description

Title

Techniques for optimizing dynamic parallelism on graphics processing units

Author(s)

El Hajj, Izzat

Issue Date

2018-12-06

Director of Research (if dissertation) or Advisor (if thesis)

Hwu, Wen-Mei W.

Doctoral Committee Chair(s)

Hwu, Wen-Mei W.

Committee Member(s)

Chen, Deming
Lumetta, Steven S.
Milojicic, Dejan S

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Graphics Processing Units
Dynamic Parallelism
Compilers
CUDA

Abstract

Dynamic parallelism is a feature of general purpose graphics processing units (GPUs) whereby threads running on a GPU can spawn other threads without CPU intervention. This feature is useful for programming applications with nested parallelism where threads executing in parallel may each identify additional work that can itself be parallelized. Unfortunately, current GPU microarchitectures do not efficiently support using dynamic parallelism for accelerating applications with nested parallelism due to the high overhead of grid launches, the limited number of grids that can execute simultaneously, and the limited supported depth of the dynamic call stack. The compiler techniques presented herein improve the performance of applications with nested parallelism that use dynamic parallelism by mitigating the aforementioned microarchitectural limitations. Horizontal aggregation fuses grids launched by threads in the same warp, block, or grid into a single aggregated grid, thereby reducing the total number of grids launched and increasing the amount of work per grid to improve occupancy. Vertical aggregation fuses grids down the call stack with their descendant grids, again reducing the total number of grids launched but also reducing the depth of the call stack and removing grid launches from the application's critical path. Evaluation of these compiler techniques shows that they result in substantial performance improvement over regular dynamic parallelism for benchmarks representing common nested parallelism patterns. This observation has held true for multiple architecture generations, showing the continued relevance of these techniques. This work shows that to make dynamic parallelism practical for accelerating applications with nested parallelism, compiler transformations can be used to aggregate dynamically launched grids, thereby amortizing their launch overhead and improving their occupancy, without the need for additional hardware support.

Graduation Semester

2018-12

Type of Resource

text

Permalink

http://hdl.handle.net/2142/102488

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Techniques for optimizing dynamic parallelism on graphics processing units

El Hajj, Izzat

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In