Withdraw
Loading…
Techniques for enabling GPU code generation of low-level optimizations and dynamic parallelism from high-level abstractions
Garcia de Gonzalo, Simon P
Loading…
Permalink
https://hdl.handle.net/2142/108570
Description
- Title
- Techniques for enabling GPU code generation of low-level optimizations and dynamic parallelism from high-level abstractions
- Author(s)
- Garcia de Gonzalo, Simon P
- Issue Date
- 2020-07-16
- Director of Research (if dissertation) or Advisor (if thesis)
- Hwu, Wen-mei
- Doctoral Committee Chair(s)
- Hwu, Wen-mei
- Committee Member(s)
- Padua, David
- Torrellas, Josep
- Hammond, Simon
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Code Generation
- Code Transformation
- Parallelism
- Parallel Algorithms
- Dynamic Parallelism
- Heterogeneity
- Performance Portability
- GPU
- DSL
- Graph Analytics
- Abstract
- The relentless demands for improvements in the compute throughput, and energy efficiency have driven HPC systems and Cloud service providers to heavily rely on GPUs. In turn, the availability of GPUs has led scientists and application programmers to invest resources in porting their codes to be GPU compatible. Currently, there are multiple ways to target GPUs for computations. From low-level C style syntax that provides for full control and most performance at the cost of slow code-development times to High-level DSLs that can abstracts the complexities of GPU programming, speeding up code-development at the cost of performance. Between these two extremes GPU libraries, pragma-based annotations, and high-level frameworks attempt to breach the gap between performance and productivity. Regardless of what strategy is used to target GPUs, performance portability remains a challenge. Performance portability is tightly coupled to architectural differences across systems. Different GPU architectures deploy different implementations of certain instructions, such as atomic instructions, or incorporate new low-level primitives to an evolving ISA. Additionally, for many applications achievable performance on any system is highly dependent on the input data being processed. Graph analytic is one such type of applications that are characterized by irregular computation in which achievable performance is dependent on the sparsity of the input graph. Current strategies for dealing with performance portability across both hardware differences and input characteristics require inefficient and time-consuming code re-writing for libraries and low-level languages or are not exposed at all in DSLs or high-level programming frameworks. The work presented herein designs a new set of high-level APIs and qualifiers, as well as specialized Abstract Syntax Tree (AST) transformations for high-level programming languages and DSLs. The proposed transformations enable warp shuffle instructions, atomic instructions (on global and shared memories), and GPU dynamic parallelism to be easily generated. A practical implementation of these transformations is built on Tangram, a high-level kernel synthesis framework. The performance of the automatically generated low-level instructions is compared against another high-level framework and a hand-written high-performance library over three generations of GPU architectures. The performance of the generated code shows up to 7.8x speedup over hand-written code. The new Tangram API that exposes GPU dynamic parallelism is used to implement four graph analytic benchmarks. Performance improvements of the Tangram generated dynamic code using six real-world graphs show between 2x and 50x speedup over the hand-written benchmarks. The speedups across different graph applications and input graphs are discussed in detail. Lastly, a triangle counting application case study is performed in order to ascertain the performance of the newly possible Tangram generated code that leverages all techniques presented in this thesis. Performance of the generated code outperforms a cutting edge, graph challenge finalist, implementation of triangle counting by over 2x. On the whole, the work presented in the thesis demonstrates that code portability across different GPU hardware and across different input for different applications is possible from a high-level programming framework.
- Graduation Semester
- 2020-08
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/108570
- Copyright and License Information
- Copyright 2020 Simon Garcia de Gonzalo
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…