Withdraw
Loading…
Reasoning, scaling, generating with vision-language models
Wang, Zhonghao
Loading…
Permalink
https://hdl.handle.net/2142/124262
Description
- Title
- Reasoning, scaling, generating with vision-language models
- Author(s)
- Wang, Zhonghao
- Issue Date
- 2024-04-16
- Director of Research (if dissertation) or Advisor (if thesis)
- Hasegawa-Johnson, Mark
- Shi, Humphrey
- Doctoral Committee Chair(s)
- Hasegawa-Johnson, Mark
- Shi, Humphrey
- Committee Member(s)
- Varshney, Lav
- Wei, Wei
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Vision-Language Models
- VLM
- Diffusion Model
- Visual Reasoning
- Video Temporal Localization
- Text-To-Image Generation
- Abstract
- The intersection of vision and language models has paved the way for groundbreaking advancements in artificial intelligence, enabling systems to comprehend and generate multimodal content with unprecedented sophistication. This dissertation presents a comprehensive study on enhancing the capabilities of vision-language models (VLMs) with a focus on three critical dimensions: reasoning, scaling, and generating. Through an innovative amalgamation of deep learning techniques, this research advances the understanding and application of VLMs in performing complex reasoning tasks, scaling to accommodate diverse and large-scale datasets, and generating coherent and contextually relevant multimodal outputs. Firstly, the dissertation introduces a novel framework for augmenting VLMs with enhanced reasoning capabilities, allowing them to infer and deduce information from visual and textual cues in a manner akin to human cognitive processes. Specifically, we study the problem of concept induction in visual reasoning, i.e., identifying concepts and their hierarchical relationships from question-answer pairs associated with images; and we achieve an interpretable model via working on the induced symbolic concept space. To this end, we first design a new framework named object-centric compositional attention model (OCCAM) to perform the visual reasoning task with object-level visual features. Then, we come up with a method to induce concepts of objects and relations using clues from the attention patterns between objects' visual features and question words. Finally, we achieve a higher level of interpretability by imposing OCCAM on the objects represented in the induced symbolic concept space. Experiments on the CLEVR and GQA datasets demonstrate: 1) our OCCAM achieves a new state of the art without human-annotated functional programs; 2) our induced concepts are both accurate and sufficient as OCCAM achieves an on-par performance on objects represented either in visual features or in the induced symbolic concept space. Secondly, the dissertation addresses the challenge of scaling VLMs, both in terms of model architecture and data handling. We propose a multi-task model architecture that improve performances for multiple downstream video tasks including temporal action localization, moment retrieval, and action segmentation. While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task. We design a new approach for this called UnLoc, which uses pretrained image and text towers, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables Moment Retrieval, Temporal Localization, and Action Segmentation with a single stage model, without the need for action proposals, motion based pretrained features or representation masking. Unlike specialized models, we achieve state of the art results on all three different localization tasks with a unified approach. Lastly, the dissertation delves into the generation capabilities of VLMs, presenting methodologies for creating accurate and diverse visual content in accordance with textual descriptions. We explore advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images. In this work, we introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation. Our proposed method employs a parameter-efficient fine-tuning framework, comprising a denoising process and a pivotal inversion process. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate the sample fidelity. Additionally, we propose a reference-guided generation approach that leverages the pivotal inversion of a reference image to mitigate unwanted subject variations and artifacts. We further extend our method to a novel image editing task: substituting the subject in an image through textual manipulations. Experimental evaluations conducted on the DreamBooth dataset using the Stable Diffusion model showcase promising results. Fine-tuning solely on textual embeddings improves CLIP-T score by 3.6 points and improves DINO score by 9.6 points over Textual Inversion. When fine-tuning all parameters, HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2 points over DreamBooth, establishing a new state of the art. This dissertation represents a significant step forward in the quest to build more intelligent vision-language models, offering insights and tools that will fuel future innovations in the field.
- Graduation Semester
- 2024-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2024 Zhonghao Wang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…