Cross-layer methods for energy-efficient inference using in-memory architectures

Gonugondla, Sujan Kumar

Cross-layer methods for energy-efficient inference using in-memory architectures

Gonugondla, Sujan Kumar

Permalink

https://hdl.handle.net/2142/107929

Description

Title

Cross-layer methods for energy-efficient inference using in-memory architectures

Author(s)

Gonugondla, Sujan Kumar

Issue Date

2020-04-28

Director of Research (if dissertation) or Advisor (if thesis)

Shanbhag, Naresh R

Doctoral Committee Chair(s)

Shanbhag, Naresh R

Committee Member(s)

Hanumolu, Pavan Kumar
Schwing, Alexander
Gopalakrishnan, Kailash

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

deep neural networks, edge, Inference, machine learning, on-chip learning, in-memory architectures, in-memory computing, application specific integrated circuits, SRAM, quantization, compression, accelerator, energy-efficiency, cross-layer

Abstract

In the near future, we will be surrounded by intelligent devices that transform the way we interact with the world. These devices need to acquire and process data to derive actions and interpretations in order to automate/monitor many tasks without human intervention. Such tasks require the implementation of complex machine learning algorithms on these devices. Deep neural networks (DNNs) have evolved into the state-of-the-art approach for machine learning tasks. However, realizing computationally intensive machine learning (ML) algorithms such as DNNs under stringent constraints on energy, latency, and form-factor is a formidable challenge. In conventional von Neumann architectures, the energy and latency cost of realizing ML algorithms is dominated by memory accesses. To address this issue, the deep in-memory architecture (DIMA) was proposed, which embeds mixed-signal computation as an integral part of the memory read cycle. Deep in-memory architectures have shown up to 100x gains in energy-delay product (EDP) over conventional digital von Neumann architectures. However, the use of mixed-signal computation makes in-memory architectures susceptible to variations and other circuit non-idealities. Therefore in-memory architectures, when implementing ML tasks, exhibit a fundamental trade-off between system-level energy, latency, and accuracy. Our research focuses on developing cross-layer methods to optimize the system-level energy-latency-accuracy of in-memory architectures for ML applications. First, an automated quantization framework is presented to minimize the precision requirements of DNNs. This framework allocates precision at kernel-level granularity via an iterative greedy process and demonstrates up to 1.2x-1.3x lower precision requirements compared to the state-of-the-art methods on compact networks such as MobileNet-V1. Next, a compositional framework is proposed that can be used to relate the energy consumption and SNR of in-memory architectures to the various circuit, architectural, and algorithmic parameters. Analysis using this framework will allow us to design in-memory architecture to meet the application-level precision requirements. The energy efficiency of DIMA can also be enhanced by the use of compensation techniques to enable a low-SNR operation without any loss in system-level accuracy. The use of stochastic gradient descent (SGD) based on-chip learning to compensate for the impact of chip-specific process variations is studied. The benefits of on-chip learning are demonstrated on a 65 nm prototype integrated circuit (IC) that shows a 2.4x reduction in energy over DIMA operating with off-chip trained weights. When compared to conventional digital architectures, this IC demonstrates up to 100x improvement in energy-delay product (EDP). In-memory architectures using beyond-CMOS technologies such as STT-MRAM and ReRAM crossbars have become popular due to their advantages in terms of density and scalability. However, such resistive crossbars suffer from inaccurate writes due to device variability and cycle-to-cycle (CTC) variations. We present the Single-Write In-memory Program-vErify (SWIPE) method to achieve high-accuracy writes for crossbar-based in-memory architectures at 5x to 10x lower cost than standard program-verify methods. SWIPE leverages the bit-sliced attribute of crossbar-based in-memory architectures and the statistics of conductance variations to compensate for device non-idealities. Extending in-memory computing to storage-class technologies such as NAND flash can be challenging due to stringent density constraints, large capacitances, and low mobility transistors. DIMA for NAND flash memories is introduced, where 8x-to-23x reduction in energy and 9x-to-15x improvement in throughput over the conventional NAND flash systems are achieved. We demonstrate that cross-layer methods are effective in enhancing the system energy, latency, and accuracy of ML systems realized via in-memory architectures.

Graduation Semester

2020-05

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/107929

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Cross-layer methods for energy-efficient inference using in-memory architectures

Gonugondla, Sujan Kumar

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In