Withdraw
Loading…
Representations from vision and language
Gupta, Tanmay
Loading…
Permalink
https://hdl.handle.net/2142/107978
Description
- Title
- Representations from vision and language
- Author(s)
- Gupta, Tanmay
- Issue Date
- 2020-05-05
- Director of Research (if dissertation) or Advisor (if thesis)
- Hoiem, Derek
- Doctoral Committee Chair(s)
- Hoiem, Derek
- Committee Member(s)
- Lazebnik, Svetlana
- Schwing, Alexander
- Gupta, Abhinav
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Vision
- Language
- Word Embeddings
- Representation Learning
- Contrastive Learning
- Phrase Grounding
- Semantic Scene Generation
- Human-Object Interaction Detection
- Deep Learning
- Transfer Learning
- Multitask Learning
- Abstract
- Replicating a human-level understanding of the physical world in computers is a monumental task. Achieving this requires building representations of concepts that manifest themselves visually, linguistically or through other senses. Furthermore concepts do not exist in isolation but are related to each other. In this work, we show how to build representations of concepts from visual and textual data, link visual manifestations of concepts to references in text descriptions (a problem known as word or phrase grounding) without strong supervision, and model the interaction between concepts. Specifically, we address the following three challenges faced by existing vision-language models: The first challenge is that of building generalizable and accurate representations of images and words. For generalization across tasks, we build aligned image-word representations that can be shared across multiple tasks like visual recognition and visual question answering and enhance inductive transfer between them. We also augment text-only word embeddings with word embeddings learned from visual co-occurrences to provide more accurate representations of visual concepts. The second challenge is linking references to visual concepts in textual descriptions to the corresponding regions in the image without requiring strong supervision in the form of word-region grounding. We show that maximizing a lower bound on mutual information between image regions and captions leads to state-of-the-art phrase grounding performance. The third challenge is extending vision-language systems to model interactions between visual entities. We build systems that demonstrate this ability in both generation and detection settings. We show how to generate a plausible layout and appearance of entities given a text description of entity actions and interactions. We also develop a state-of-the-art factored model and training techniques for detecting human-object interactions using pretrained object and pose detectors.
- Graduation Semester
- 2020-05
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/107978
- Copyright and License Information
- Copyright 2020 Tanmay Gupta
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…