Withdraw
Loading…
Generative deep learning: Towards better visual representations and multimodal
Xu, Xingqian
Loading…
Permalink
https://hdl.handle.net/2142/121491
Description
- Title
- Generative deep learning: Towards better visual representations and multimodal
- Author(s)
- Xu, Xingqian
- Issue Date
- 2023-07-14
- Director of Research (if dissertation) or Advisor (if thesis)
- Shi, Humphrey
- Doctoral Committee Chair(s)
- Shi, Humphrey
- Committee Member(s)
- Hasegawa-Johnson, Mark
- Hwu, Wen-mei
- Wang, Xiaolong
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Generative Model
- Representation Learning
- Mulitmodal Learning
- Computer Vision
- Deep Learning
- Abstract
- Generative AI aims to formulate certain types of data distribution so that it can generate new data instances mimicking true samples from the underlining distribution. It is also worth mentioning that in Computer Vision, generative and discriminative models are two major categories. While the latter aims to accurately predict classes, object locations, segmentations, etc. based on a specific data instance, the former explores and manufactures the complex data manifold. One may argue that generative AI in computer vision needs to be more advanced due to its intentions to simulate real-world data in unrestricted domains with tremendous complexity. Yet, even with the most complex network design, formulating the exact data distribution in our natural world is most likely inconceivable, thus leaving much space to improve. With the recent technology booms in Generative AI, nowadays researchers and engineers create high-performing generative solutions that start to handle real-world needs as commercial products, and luckily this thesis takes part as well. In this thesis, the author targets to further push generative AI performance by exploring the best possible visual representation form (i.e. neural implicit embedding, spectral-domain representation, transformer-based representation) that captures as much visual information as possible. Unquestionably, data representation is a critical premise to Generative AI as it reveals the upper bound of the model's capacity. Moreover, from a broader but less precise angle, the goal of generative modeling, simulating accurate data distribution, also serves as a kind of representation learning. In the final part of this thesis, the author also investigated the topic that goes beyond visual representation, towards more general forms of cross-modal representations that fit multiple types of data modalities, which is a heuristic step towards an even more challenging quest: General AI. This thesis begins with UltraSR, which explores the implicit neural visual representation that well-fits image super-resolution by synthesizing image details with arbitrary upsampling scales. The core idea of UltraSR integrates implicit neural representation with learnable periodic encoding, formulating visual details in the high-frequency manifold in a continuous function. While UltraSR explores the neural visual representation, Spectral Hint GAN (SH-GAN) takes a diverse routine that deeply involves visual features in the frequency domain for image completion. SH-GAN proposed a novel spectral network module: Spectral Hint Unit (SHU), together with two new strategies: Heterogeneous Filtering and Gaussian Split. SH-GAN outperforms prior image completion methods due to the followings: effective inpaint low-frequency image structure via StyleGAN-based co-modulation framework, and effective inpaint high-frequency image texture via SHU. The recent process in text-to-image (T2I) diffusion models inspires us to explore a new work Prompt-Free Diffusion, in which we substitute CLIP text encoder with SeeCoder to capture visual cues, removing the need for prompts from the T2I system. SeeCoder automatically distillates all sorts of visual cues, including but not limited to semantics, texture, backgrounds, etc., and transpassing them to diffusion models. Our synthetic results are both high-quality and closely follow the reference visual cues encoded by SeeCoder. Parallel with Prompt-Free Diffusion, we proposed Versatile Diffusion, which is the first work proposing a unified multimodal multi-flow diffusion pipeline that uniformly handles numerous cross-modal tasks, generating images, text, and variations. Versatile Diffusion has a broader scope in which our goal is to combine the representation of different modalities into one generative network, making a bold step toward universal generative AI. In conclusion, all works have provided valuable insights into data representation, among which UltraSR, SH-GAN, and Prompt-Free Diffusion actively explore the best visual representation under three schemes: implicit neural representation, spectral-domain representation, and transformer-based representation. In the last part, Versatile Diffusion explores a unified representation and generation of image, text, and image-text cross-modals. UltraSR outperforms the baseline model by 0.05 dB on DIV2K across all scales. SH-GAN reaches FID 3.41 on dataset FFHQ and 7.10 on dataset Places2, acquiring a new state-of-the-art in the large-scale free-form image completion task. Prompt-Free Diffusion and SeeCoder fulfill the popular exemplar-based image generation task with stunning quality. Versatile Diffusion's CLIP-similarity are 0.269 and 0.858; FIDs are 11.20 and 4.57 on Coco2014, measuring Text-to-Image and Image-Variation, outperforms the baseline Stable Diffusion in all aspects.
- Graduation Semester
- 2023-08
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2023 Xingqian Xu
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…