Adaptive model inference under computational resource constraints

Yu, Haichao

Adaptive model inference under computational resource constraints

Yu, Haichao

Content Files

YU-DISSERTATION-2022.pdf

Permalink

https://hdl.handle.net/2142/115689

Description

Title

Adaptive model inference under computational resource constraints

Author(s)

Yu, Haichao

Issue Date

2022-04-11

Director of Research (if dissertation) or Advisor (if thesis)

Shi, Humphrey

Doctoral Committee Chair(s)

Shi, Humphrey

Committee Member(s)

Liang, Zhi-Pei
Hasegawa-Johnson, Mark
Hua, Gang
Huang, Gao

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Adaptive Inference
Dynamic Neural Networks
Efficient Inference
Neural Network Quantization
Gradient Boosting

Abstract

Deep learning has achieved great success in various computer vision tasks. When deploying deep learning models on different devices or a single device at different times, the amount of available resources can be different. Therefore, it is an important research topic to design neural networks that can adapt to various scenarios with different computational resources and achieve a trade-off between accuracy and efficiency. To this end, one branch of existing methods is adaptive inference with sample-wise dynamic architecture. These methods can be grouped into three categories: dynamic routing, dynamic depth, and dynamic width. In this dissertation, we will focus on dynamic post-training quantization (PQ) in the dynamic routing category and early exiting in the dynamic depth category. Existing PQ methods have two problems: (a) require a possibly inaccessible training dataset for model calibration, and (b) are hard to be generalized to low bit-width quantization precision. For early-exiting methods, we find there exists a train-test mismatch problem in prior works that can cause a train-test domain gap. In this dissertation, we will explore how to solve these three problems to improve the generalizability and performance of the dynamic models. Given constrained computational resources, PQ uses a calibration dataset to quantize a trained floating-point neural network to an efficient quantized one. The calibration dataset is usually sampled from training data. However, training data may not be available because of its large storage requirement or confidential issues when the model is being quantized. In the first part of this dissertation, we propose a cross-domain calibration method that uses out-of-domain data for calibration when in-domain training data is unavailable so that PQ can be applied more flexibly. PQ works well in the case of relatively high quantization bit-widths. To make the model work in a wider range of bit-widths, we propose Any-Precision Deep Neural Network (APDNN), which is trained with a new method that allows the learned network to be flexible in numerical precision during inference. The same model in runtime can be flexibly set to different bit-widths to support the dynamic trade-off of speed and accuracy. At each bit-width, the model achieves comparable accuracy to dedicated models trained at the same precision. We demonstrate that this achievement is agnostic to model architectures and applicable to multiple vision tasks. Going further in the direction of dynamic inference, we propose an early-exiting dynamic neural network (EDNN) that can adjust its inference complexity depending on not only computational resources but also input difficulty. When resources are limited or inputs are easy to process, the model will exit at some shallow layer of the model. Otherwise, the model can execute more layers to get a more confident prediction. Different from existing EDNN methods, we formulate EDNN as an additive model inspired by gradient boosting and propose Boosted Dynamic Neural Network (BoostNet). Our method mitigates the train-test mismatch problem that exists in the existing EDNN methods. In summary, the proposed cross-domain calibration method enables more flexible PQ for dynamic quantization without in-domain training data, the APDNN method generalizes dynamic quantization to extremely low bit-widths, and BoostNet introduces extra flexibility by making the model architecture depend on inputs and mitigates the train-test mismatch problem. In addition, BoostNet can be combined with the APDNN model to allow even more inference flexibility.

Graduation Semester

2022-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/115689

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Adaptive model inference under computational resource constraints

Yu, Haichao

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In