A good teacher is all you need

Farhat-Sabet, Sean

A good teacher is all you need

Farhat-Sabet, Sean

Content Files

FARHAT-SABET-THESIS-2023.pdf

Permalink

https://hdl.handle.net/2142/121553

Description

Title

A good teacher is all you need

Author(s)

Farhat-Sabet, Sean

Issue Date

2023-07-20

Director of Research (if dissertation) or Advisor (if thesis)

Chen, Deming

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Computer Science
Artificial Intelligence
Machine Learning
Pre-Training
Finetuning
Knowledge Distillation
Contrastive Learning
Small Models
Training Efficiency
Generative Models
Diffusion Models
Synthetic Data

Abstract

In this thesis, we tackle the problem of improving the performance of small machine learning models. A classic approach to enhance any model is to first pre-train it on an enormous, diverse dataset such as ImageNet, thereby equipping it with a strong feature backbone. Subsequently, finetuning this model on a desired downstream task results in generally better performance, compared to if it was only trained on the task. Larger models tend to benefit more from this pre-training paradigm, and several attempts have been made to minimize the drop in performance when the model size is decreased. We propose a simple, straightforward, and alternate approach that completely avoids the costly pre-training by (1) taking a large, publicly available pre-trained model, (2) finetuning it on the desired task, and (3) teaching its knowledge of that task to a target small model. Surprisingly, this leads to performance competitive with the pre-training paradigm, sometimes even surpassing it, while only using a fraction of the resources. Our approach can be viewed as designing a stronger knowledge distillation (KD) setup by explicitly considering the teacher’s knowledge and the student’s questions (transfer dataset), as well as introducing 2 variants of a novel knowledge transfer algorithm. The first is derived from our perspective of KD as a form of Noise Contrastive Estimation (NCE), thereby allowing any tool from contrastive learning to be used. We choose one, the Alignment/Uniformity metric, as an illustration. The second uses ideas from work on metrics for high-dimensional representations, specifically GULP. Lastly, we can gain a further boost in performance by augmenting the transfer dataset with synthetically generated samples from a publicly available, pre-trained, text-to-image generative diffusion model. We test our method on 2 small models across 5 visual recognition tasks, most of which are data-limited. When compared to when they were pre-trained and finetuned, our small models either surpass their performance or lag behind by at most 1.5%, while cutting training time up to 95%. Thus, we refer to our paradigm as: Don’t Pre-train, Teach (DPT).

Graduation Semester

2023-08

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/121553

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

A good teacher is all you need

Farhat-Sabet, Sean

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In