Withdraw
Loading…
Video pretrained transformer with an ensemble of experts
Christl, Daniel
This item's files can only be accessed by the Administrator group.
Permalink
https://hdl.handle.net/2142/121245
Description
- Title
- Video pretrained transformer with an ensemble of experts
- Author(s)
- Christl, Daniel
- Issue Date
- 2023-07-19
- Director of Research (if dissertation) or Advisor (if thesis)
- Ji, Heng
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- transfer learning
- multimodal learning
- Abstract
- I present VideoSemble, a novel multimodal decoder-only model capable of comprehending and generating outputs across diverse modalities, including text, audio, images, and scene graphs. By incorporating state-of-the-art pretrained encoders for each modality, the model demonstrates a deep understanding of the underlying context and relationships present in the data. To train the model, the extensive YouTube-1B corpus is leveraged, consisting of 20 million YouTube videos that provide rich, multimodal context. This pretraining objective focuses on autoregressive output generation and contrastive learning, utilizing the non-output modali- ties as context. This approach encourages the model to form meaningful connections between various modalities and develop a comprehensive understanding of the data. Following pretraining, the model is finetuned and evaluated on two benchmark datasets: the TV Question dataset, designed to assess multimodal question-answering capabilities, and the Kinetics-600 dataset, which measures action recognition and understanding in videos. The proposed model demonstrates inconsistent performances in both tasks, showcasing its potential ability to effectively synthesize information from multiple modalities and generate coherent, context-aware textual outputs, while also providing reservations about pretraining and finetuning methodologies utilized. The findings presented in this thesis contribute to the growing body of research in multi- modal understanding and generation, providing a robust and versatile framework for future exploration in the field. By combining state-of-the-art encoders with a decoder-only archi- tecture, VideoSemble offer new insights into the potential for deep learning models to grasp the complex interplay between modalities and generate meaningful outputs across a diverse range of contexts.
- Graduation Semester
- 2023-08
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2023 Daniel Christl
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…