Video pretrained transformer with an ensemble of experts

Christl, Daniel

Video pretrained transformer with an ensemble of experts

Christl, Daniel

This item's files can only be accessed by the Administrator group.

Permalink

https://hdl.handle.net/2142/121245

Description

Title

Video pretrained transformer with an ensemble of experts

Author(s)

Christl, Daniel

Issue Date

2023-07-19

Director of Research (if dissertation) or Advisor (if thesis)

Ji, Heng

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

transfer learning
multimodal learning

Abstract

I present VideoSemble, a novel multimodal decoder-only model capable of comprehending and generating outputs across diverse modalities, including text, audio, images, and scene graphs. By incorporating state-of-the-art pretrained encoders for each modality, the model demonstrates a deep understanding of the underlying context and relationships present in the data. To train the model, the extensive YouTube-1B corpus is leveraged, consisting of 20 million YouTube videos that provide rich, multimodal context. This pretraining objective focuses on autoregressive output generation and contrastive learning, utilizing the non-output modali- ties as context. This approach encourages the model to form meaningful connections between various modalities and develop a comprehensive understanding of the data. Following pretraining, the model is finetuned and evaluated on two benchmark datasets: the TV Question dataset, designed to assess multimodal question-answering capabilities, and the Kinetics-600 dataset, which measures action recognition and understanding in videos. The proposed model demonstrates inconsistent performances in both tasks, showcasing its potential ability to effectively synthesize information from multiple modalities and generate coherent, context-aware textual outputs, while also providing reservations about pretraining and finetuning methodologies utilized. The findings presented in this thesis contribute to the growing body of research in multi- modal understanding and generation, providing a robust and versatile framework for future exploration in the field. By combining state-of-the-art encoders with a decoder-only archi- tecture, VideoSemble offer new insights into the potential for deep learning models to grasp the complex interplay between modalities and generate meaningful outputs across a diverse range of contexts.

Graduation Semester

2023-08

Type of Resource

Thesis

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Video pretrained transformer with an ensemble of experts

Christl, Daniel

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In