Dynamic multimodal learning: Empowering ai to interpret the temporally dynamic world through vision, language, audio, and video
Khosla, Savya
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/124514
Description
Title
Dynamic multimodal learning: Empowering ai to interpret the temporally dynamic world through vision, language, audio, and video
Author(s)
Khosla, Savya
Issue Date
2024-04-12
Director of Research (if dissertation) or Advisor (if thesis)
Hoiem, Derek W
Department of Study
Computer Science
Discipline
Computer Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Multimodal Learning
Video Representation Learning
Large Multimodal Models
Abstract
As humans, we perceive and comprehend our surroundings through various sensory inputs. Multimodal learning aims to empower machines to do the same -- learn by leveraging multiple modalities, such as vision, language, and audio, to develop a more holistic understanding of the world. This learning approach not only enhances the capabilities of Artificial Intelligence (AI) systems but also enables them to better navigate and interact with real-world scenarios. To further augment AI systems' understanding of the real world, it is essential to equip them with the ability to comprehend its dynamic nature. So, with the goal of enhancing AI systems' interpretation of multiple modalities and understanding of the temporally dynamic world, this work focuses on two key areas of investigation:
The first area of investigation focuses on building general-purpose systems that are capable of performing tasks requiring several different modalities. To this end, we propose the first autoregressive multimodal model that is capable of parsing images, texts, audio, and videos as input, and generating images, texts, and audio as output. In particular, we discuss techniques to represent the multiple modalities into a shared semantic space, process them with a single encoder-decoder transformer model, stabilize model training, and evaluate its performance on a broad array of over 120 multimodal tasks.
The second area of investigation explores multimodal learning within the dynamic context of the world, as represented in videos. Our focus lies on training a memory-augmented video encoder by jointly supervising various modalities present in video data. We showcase the proposed encoder's proficiency in modeling long-form videos while capturing both nuanced and overarching details of the video content. Additionally, we demonstrate the generalizability of the learned representations by adapting them to a challenging downstream task without any task-specific bells and whistles.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.