Withdraw
Loading…
Unifying cross-modal concepts in vision and language
Whitehead, Spencer Robert
Loading…
Permalink
https://hdl.handle.net/2142/110465
Description
- Title
- Unifying cross-modal concepts in vision and language
- Author(s)
- Whitehead, Spencer Robert
- Issue Date
- 2021-04-14
- Director of Research (if dissertation) or Advisor (if thesis)
- Ji, Heng
- Doctoral Committee Chair(s)
- Ji, Heng
- Committee Member(s)
- Schwing, Alexander
- Zhai, ChengXiang
- Chang, Shih-Fu
- Saenko, Kate
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Vision and Language
- Computer Vision
- Natural Language Processing
- NLP
- Multimedia
- Visual Question Answering
- VQA
- Video Captioning
- Contrastive Learning
- Phrase Grounding
- Knowledge-aware Text Generation
- Abstract
- Enabling computers to demonstrate a proficient understanding of the physical world is an exceedingly challenging task that necessitates the ability to perceive, through vision or other senses, and communicate through natural language. Key to this endeavor is the representation of concepts present in the world within and across different modalities (e.g., vision and language). To an extent, models can capture concepts implicitly through using large quantities of training data. However, the complementary inter-modal and intra-modal connections between concepts are often not captured, which leads to issues such as difficulty generalizing a concept to new contexts or different appearances and an inability to integrate concepts from different sources. The focus of this dissertation is developing ways to represent concepts within models in a unified fashion across vision and language. In particular, there are three challenges that we address: 1) Linking instances of concepts across modalities without strong supervision or large amounts of data external to the target task. In visual question answering, models tend to rely on contextual cues or learned priors instead of actually recognizing and linking concepts across modalities. Consequently, when a concept appears in a new context, models often fail to adapt. We learn to ground concept mentions in text to image regions in the context of visual question answering using self-supervision. We also demonstrate that learning concept grounding helps facilitate the disentanglement of the skills required to answer questions and concept mentions, which can improve generalization to novel compositions of skills and concepts. 2) Consistency towards different mentions of the same concept. An instance of a concept can take many different forms, such as the appearance of a concept in different images or the use of synonyms in text, and it can be difficult for models to infer these relationships from the training data alone. We show that existing visual question answering models have difficulty handling even straightforward changes in concept mentions and the wordings of the questions. We enforce consistency for related questions in these models not only of the answers, but also of the computed intermediate representations, which improves robustness to such variations. 3) Modeling associations between related concepts in complex domains. In scenarios where multiple related sources of information need to be considered, models must be able to connect concepts found within and across these different sources. We introduce the task of knowledge-aware video captioning for news videos, where models must generate descriptions of videos that leverage interconnected background knowledge pertaining to concepts involved in the videos. We build models that learn to associate patterns of concepts found in related news articles, such as entities and events, with video content in order to generate these knowledge-rich descriptions.
- Graduation Semester
- 2021-05
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/110465
- Copyright and License Information
- Copyright Spencer Whitehead 2021
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…