Withdraw
Loading…
Learning video representations with limited supervision
McKee, Daniel Benjamin
Loading…
Permalink
https://hdl.handle.net/2142/121988
Description
- Title
- Learning video representations with limited supervision
- Author(s)
- McKee, Daniel Benjamin
- Issue Date
- 2023-11-22
- Director of Research (if dissertation) or Advisor (if thesis)
- Lazebnik, Svetlana
- Doctoral Committee Chair(s)
- Lazebnik, Svetlana
- Committee Member(s)
- Forsyth, David
- Hoiem, Derek
- Tighe, Joseph
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- computer vision
- deep learning
- video
- self-supervised learning
- weakly supervised learning
- language models
- multi-modal models
- object tracking
- Abstract
- With the rapid growth of deep computer vision models, demand for large quantities of annotated data has risen higher than ever. Obtaining visual annotations, especially dense annotations requiring fine-grained localization of objects, is a costly and intensive process. Dense video tasks like tracking or video object segmentation provide an even greater annotation challenge due to the steep cost increase associated with labeling many individual frames. As a result, datasets for these tasks often lack the scale and diversity of samples in annotated image datasets. To combat such limitations, we investigate how we can take advantage of unlabeled videos, image annotations, and transfer of large-scale pretrained models to achieve effective performance on dense video tasks. First, we study representations for dense label propagation tasks in video, focusing on self-supervised approaches to learning temporal correspondence and comparing how image-trained models might be adapted for these tasks. Second, we investigate how to train a multi-object tracking model in the absence of tracking annotations. In place of fully supervised annotations, we demonstrate how to learn from unlabeled videos and videos that are hallucinated from annotated images using data augmentation techniques. Lastly, we explore a multi-modal problem setting where we wish to automatically recommend an audio soundtrack for an input video and text description of desired music. In this setting, we explore adapting large scale models like CLIP for joint modeling of video, text, and audio. We also investigate mechanisms for generating text pseudo-label descriptions for training using recent large language models.
- Graduation Semester
- 2023-12
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2023 Daniel McKee
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…