Withdraw
Loading…
Learning joint latent representations for images and language
Wang, Liwei
Loading…
Permalink
https://hdl.handle.net/2142/101544
Description
- Title
- Learning joint latent representations for images and language
- Author(s)
- Wang, Liwei
- Issue Date
- 2018-07-10
- Director of Research (if dissertation) or Advisor (if thesis)
- Lazebnik, Svetlana
- Doctoral Committee Chair(s)
- Lazebnik, Svetlana
- Committee Member(s)
- Forsyth, David
- Hockenmaier, Julia
- Schwing, Alexander
- Tu, Zhuowen
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- deep learning
- computer vision
- Abstract
- Computer vision is moving from predicting discrete, categorical labels to generating rich descriptions of visual data, in particular, in the form of natural language. Learning the joint latent representations for images and language is vital to solving many image-text tasks, including image-sentence retrieval, visual grounding, and image captioning, etc. In this thesis, we first propose two-branch neural networks for learning the similarity between these two data modalities. Two network structures are proposed to produce different output representations. The first one, referred to as an embedding network, learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. The second network structure, referred to as a similarity network, fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization in the Flickr30K Entities dataset and for bi-directional image-sentence retrieval in the Flickr30K and COCO datasets. Then, we explore the image captioning problem using conditional variational auto-encoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space with K components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g. several kinds of objects). The first model uses a Gaussian Mixture model (GMM) prior while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. Experiments show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a “vanilla” CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise. In order to further improve the caption decoder inherited from the AG-CVAE model, we attempt to train it by optimizing caption evaluation metrics (e.g. BLEU scores) using policy gradient from reinforcement learning. The loss function contains two terms: one is maximum likelihood estimator (MLE loss) and the other one is a reinforcement term based on a sum of non-differentiable rewards. Experiments show that training the decoder with this combination loss can help to generate more accurate captions. We also study the problem of ranking generated sentences conditioned on the image input and explore several variants of deep rankers built on top of the two-branch networks proposed earlier.
- Graduation Semester
- 2018-08
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/101544
- Copyright and License Information
- Copyright 2018 Liwei Wang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…