Techniques for enabling GPU code generation of low-level optimizations and dynamic parallelism from high-level abstractions

Garcia de Gonzalo, Simon P

Techniques for enabling GPU code generation of low-level optimizations and dynamic parallelism from high-level abstractions

Garcia de Gonzalo, Simon P

Permalink

https://hdl.handle.net/2142/108570

Description

Title

Techniques for enabling GPU code generation of low-level optimizations and dynamic parallelism from high-level abstractions

Author(s)

Garcia de Gonzalo, Simon P

Issue Date

2020-07-16

Director of Research (if dissertation) or Advisor (if thesis)

Hwu, Wen-mei

Doctoral Committee Chair(s)

Hwu, Wen-mei

Committee Member(s)

Padua, David
Torrellas, Josep
Hammond, Simon

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Code Generation
Code Transformation
Parallelism
Parallel Algorithms
Dynamic Parallelism
Heterogeneity
Performance Portability
GPU
DSL
Graph Analytics

Abstract

The relentless demands for improvements in the compute throughput, and energy efficiency have driven HPC systems and Cloud service providers to heavily rely on GPUs. In turn, the availability of GPUs has led scientists and application programmers to invest resources in porting their codes to be GPU compatible. Currently, there are multiple ways to target GPUs for computations. From low-level C style syntax that provides for full control and most performance at the cost of slow code-development times to High-level DSLs that can abstracts the complexities of GPU programming, speeding up code-development at the cost of performance. Between these two extremes GPU libraries, pragma-based annotations, and high-level frameworks attempt to breach the gap between performance and productivity. Regardless of what strategy is used to target GPUs, performance portability remains a challenge. Performance portability is tightly coupled to architectural differences across systems. Different GPU architectures deploy different implementations of certain instructions, such as atomic instructions, or incorporate new low-level primitives to an evolving ISA. Additionally, for many applications achievable performance on any system is highly dependent on the input data being processed. Graph analytic is one such type of applications that are characterized by irregular computation in which achievable performance is dependent on the sparsity of the input graph. Current strategies for dealing with performance portability across both hardware differences and input characteristics require inefficient and time-consuming code re-writing for libraries and low-level languages or are not exposed at all in DSLs or high-level programming frameworks. The work presented herein designs a new set of high-level APIs and qualifiers, as well as specialized Abstract Syntax Tree (AST) transformations for high-level programming languages and DSLs. The proposed transformations enable warp shuffle instructions, atomic instructions (on global and shared memories), and GPU dynamic parallelism to be easily generated. A practical implementation of these transformations is built on Tangram, a high-level kernel synthesis framework. The performance of the automatically generated low-level instructions is compared against another high-level framework and a hand-written high-performance library over three generations of GPU architectures. The performance of the generated code shows up to 7.8x speedup over hand-written code. The new Tangram API that exposes GPU dynamic parallelism is used to implement four graph analytic benchmarks. Performance improvements of the Tangram generated dynamic code using six real-world graphs show between 2x and 50x speedup over the hand-written benchmarks. The speedups across different graph applications and input graphs are discussed in detail. Lastly, a triangle counting application case study is performed in order to ascertain the performance of the newly possible Tangram generated code that leverages all techniques presented in this thesis. Performance of the generated code outperforms a cutting edge, graph challenge finalist, implementation of triangle counting by over 2x. On the whole, the work presented in the thesis demonstrates that code portability across different GPU hardware and across different input for different applications is possible from a high-level programming framework.

Graduation Semester

2020-08

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/108570

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Techniques for enabling GPU code generation of low-level optimizations and dynamic parallelism from high-level abstractions

Garcia de Gonzalo, Simon P

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In