Withdraw
Loading…
A good teacher is all you need
Farhat-Sabet, Sean
Loading…
Permalink
https://hdl.handle.net/2142/121553
Description
- Title
- A good teacher is all you need
- Author(s)
- Farhat-Sabet, Sean
- Issue Date
- 2023-07-20
- Director of Research (if dissertation) or Advisor (if thesis)
- Chen, Deming
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Computer Science
- Artificial Intelligence
- Machine Learning
- Pre-Training
- Finetuning
- Knowledge Distillation
- Contrastive Learning
- Small Models
- Training Efficiency
- Generative Models
- Diffusion Models
- Synthetic Data
- Abstract
- In this thesis, we tackle the problem of improving the performance of small machine learning models. A classic approach to enhance any model is to first pre-train it on an enormous, diverse dataset such as ImageNet, thereby equipping it with a strong feature backbone. Subsequently, finetuning this model on a desired downstream task results in generally better performance, compared to if it was only trained on the task. Larger models tend to benefit more from this pre-training paradigm, and several attempts have been made to minimize the drop in performance when the model size is decreased. We propose a simple, straightforward, and alternate approach that completely avoids the costly pre-training by (1) taking a large, publicly available pre-trained model, (2) finetuning it on the desired task, and (3) teaching its knowledge of that task to a target small model. Surprisingly, this leads to performance competitive with the pre-training paradigm, sometimes even surpassing it, while only using a fraction of the resources. Our approach can be viewed as designing a stronger knowledge distillation (KD) setup by explicitly considering the teacher’s knowledge and the student’s questions (transfer dataset), as well as introducing 2 variants of a novel knowledge transfer algorithm. The first is derived from our perspective of KD as a form of Noise Contrastive Estimation (NCE), thereby allowing any tool from contrastive learning to be used. We choose one, the Alignment/Uniformity metric, as an illustration. The second uses ideas from work on metrics for high-dimensional representations, specifically GULP. Lastly, we can gain a further boost in performance by augmenting the transfer dataset with synthetically generated samples from a publicly available, pre-trained, text-to-image generative diffusion model. We test our method on 2 small models across 5 visual recognition tasks, most of which are data-limited. When compared to when they were pre-trained and finetuned, our small models either surpass their performance or lag behind by at most 1.5%, while cutting training time up to 95%. Thus, we refer to our paradigm as: Don’t Pre-train, Teach (DPT).
- Graduation Semester
- 2023-08
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2023 Sean Farhat-Sabet
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…