Energy efficient computing exploiting data similarity and computation redundancy

Liu, Zhenhong

Energy efficient computing exploiting data similarity and computation redundancy

Liu, Zhenhong

Permalink

https://hdl.handle.net/2142/104973

Description

Title

Energy efficient computing exploiting data similarity and computation redundancy

Author(s)

Liu, Zhenhong

Issue Date

2019-04-11

Director of Research (if dissertation) or Advisor (if thesis)

Kim, Nam Sung

Doctoral Committee Chair(s)

Hwu, Wen-Mei

Committee Member(s)

Torrellas, Josep
Shanbhag, Naresh R.

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Approximate computing
GPU architecture
Memoization
Computation redundancy

Abstract

Applications in various fields, such as machine learning, scientific computing and signal/image processing, need to deal with real-world input datasets. Such input datasets are usually discrete samples of slow-changing, continuous data of physical phenomena, like temperature maps and images. Due to the continuous nature of the physical phenomena, these datasets often contain data points with similar or even identical values. Eventually, it results in repeated operations performed on the same or similar data points, i.e. redundant computation. Redundant computation can be exploited to improve the energy/power efficiency and performance of processors, especially when the benefits from process technology scaling and power scaling keep diminishing. This dissertation first proposes a cost-effective generalized scalar execution architecture for GPUs, called G-Scalar. It exploits the redundant computation performed on identical data. G-Scalar offers two key advantages over prior architectures supporting scalar execution for only non-divergent arithmetic/logic instructions. First, G-Scalar is more power efficient as it can also support scalar execution of divergent and special-function instructions, the fraction of which in contemporary GPU applications has notably increased. Second, G-Scalar is less expensive as it can share most of its hardware resources with register value compression, of which adoption has been strongly promoted to reduce high power consumption of accessing the large register file. Compared with the baseline and previous scalar architectures, G-Scalar improves power efficiency by 24% and 15%, respectively, at a negligible cost. Lock and Load (LnL) architecture is then proposed to extend the coverage to redundant computation on similar data, by enabling approximate computing. In the LnL architecture, approximate computing is triggered by similarity of values returned by load instructions, and then approximation is applied to the annotated code region following the load instructions. Such a design reduces the overhead of checking eligibility of approximation for every instruction and allows us to deploy more sophisticated techniques for checking the eligibility of approximation and approximating the output values for all the skipped threads at the end. LnL is further enhanced to fuse the same approximated instructions from multiple warps into a single instruction, exploiting the fact that only a subset of threads in approximated warps are executed and many execution lanes are left unused by the skipped threads. This not only improves the performance but also reduces energy consumption, as it reduces the number of fetched, decoded, scheduled and executed instructions. Our experiment shows that LnL can improve energy efficiency and performance by 62% and 23%, respectively. Finally, we propose AxMemo to exploit the computation redundancy on CPUs. Inspired by LnL, AxMemo focuses on memoizing relatively large blocks of code with a variable number of inputs. In contrast, existing memoization techniques mostly replace costly floating-point operations that have a limited number of inputs with memory lookup. Since AxMemo aims to replace long sequences of instructions with a few lookup operations, it alleviates the von Neumann and execution overheads of passing instructions through the processor pipeline altogether. To address the challenge of handling various numbers of inputs, we develop a novel use of Cyclic Redundancy Checking (CRC) to hash the inputs and use the hash as lookup tags. It enables AxMemo to efficiently memoize relatively large code regions with variable input sizes and types using the same underlying hardware. Our experiment shows that AxMemo offers 2.82× speedup and 63% energy reduction, with mere 0.2% of quality loss averaged across ten benchmarks. These benefits come with an area overhead of just 2.08%.

Graduation Semester

2019-05

Type of Resource

text

Permalink

http://hdl.handle.net/2142/104973

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Energy efficient computing exploiting data similarity and computation redundancy

Liu, Zhenhong

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In