Training a massively multimodal transformer on YouTube data: pre-training and parameter efficient fine-tuning on HPC infrastructure

Day, Kastan Vrabel

Training a massively multimodal transformer on YouTube data: pre-training and parameter efficient fine-tuning on HPC infrastructure

Day, Kastan Vrabel

Permalink

https://hdl.handle.net/2142/120172

Description

Title

Training a massively multimodal transformer on YouTube data: pre-training and parameter efficient fine-tuning on HPC infrastructure

Author(s)

Day, Kastan Vrabel

Issue Date

2023-05-04

Director of Research (if dissertation) or Advisor (if thesis)

Kindratenko, Volodymyr

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

AI
ML
LLMs
Multimodal Transformers
PEFT
RLHF
RLAIF

Abstract

In machine learning, the widespread adoption of pre-trained large language models (LLMs) across many domains has disrupted conventional wisdom on the right model to use for the job. This work investigates the advantages of pre-training large language models from scratch over fine-tuning existing ones in specific scenarios, such as exploring the importance of custom tokenizers for domain-specific applications, addressing information leakage concerns, and examining the use of LLMs in non-traditional applications such as time-series forecasting. When pre-training is unnecessary, this work argues that select parameter efficient fine-tuning (PEFT) methods are strictly superior to traditional-fine tuning for data and computational efficiency and should be preferred in nearly all cases. Furthermore, after PEFT, it is ideal to further sculpt the outputs of one’s LLM with Reinforcement Learning with Human Feedback (RLHF). This work argues that reinforcement learning (RL), rather than any form of supervised fine tuning (SFT), is preferable to achieve truthfulness without hallucination. Practitioners should seek to leverage the benefits of RLHF via RL with AI feedback (RLAIF), an effective, fast, and economical alternative to human feedback that retains the benefits of reward modeling for factuality. Additionally, the paper discusses the three most successful learning objectives in multimodal transformers and the challenges they face in aligning distinct embedding spaces. I present my own model Video Pre-trained Transformer: A Multimodal Mixture of Pre-Trained Experts for video question answering tasks as benchmarked against VQAv2. The importance of modern ML- first databases and filesystems is explored in the context of multimodal, multi-model AI systems for fast and flexible data throughput on HPC and multi-cloud systems. Together, the paper represents the state of open-source LLMs and opportunities for researchers and practitioners to combine the attractive properties of PEFT, RLAIF, and multi-modal transformers that paves the way for the next few years of growth in AI capabilities.

Graduation Semester

2023-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/120172

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Training a massively multimodal transformer on YouTube data: pre-training and parameter efficient fine-tuning on HPC infrastructure

Day, Kastan Vrabel

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In