Towards bridging generative and discriminative learning for visual perception
Zheng, Shuhong
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/124511
Description
Title
Towards bridging generative and discriminative learning for visual perception
Author(s)
Zheng, Shuhong
Issue Date
2024-04-29
Director of Research (if dissertation) or Advisor (if thesis)
Wang, Yuxiong
Department of Study
Computer Science
Discipline
Computer Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Generative Models
Visual Perception
Abstract
Generative models have taken the prevalence in artificial intelligence (AI) these days, as people get increasingly astonished by their impressive generative capability. However, most people solely focus on their applications of synthesizing creative content for entertainment purposes. With a different perspective, this thesis strives to explore the possibility of bridging generative and discriminative learning for visual perception tasks. Our first attempt is to leverage the neural radiance fields (NeRF) for visual perception which enables 3D awareness for scene understanding. Previous discriminative visual perception models only take in a single-view observation as input for visual attribute prediction, which overlook the underlying 3D information within the scene. To overcome this limitation, we propose a novel task called multi-task view synthesis (MTVS), which evaluates comprehensive visual perception ability of 3D scenes. We also propose an innovative framework called Multi-task and cross-view NeRF (MuvieNeRF) equipped with both multi-task and cross-view reasoning capability to solve this challenging problem. We demonstrate that the joint modeling of multi-view information and multi-modal visual attributes could benefit the performance on the challenging MTVS task. As building a NeRF representation requires multi-view observations of the scenes, which limits the real application, we move a step further to leverage diffusion models for visual perception. Diffusion models are pretrained on large-scale datasets which empower them with informative feature representations for visual input, and consequently with incredible ability of solving discriminative tasks. We propose to incorporate the discriminative capability together with their inherent generative power into one unified model for better bridging the two perspectives. Our presented Self-improving UNified Diffusion (SUNDiff) achieves this by having two sets of parameters within a single model for data generation and exploitation. We show that the unified diffusion model design brings performance gain to various types of visual perception tasks and is beneficial for diverse model architectures. Looking ahead, as generative models keep fast evolving with increasingly impressive generation capability, it is worthwhile to exploit the discriminative potential within these more powerful generative models. Meanwhile, the regime of utilizing generative models for discriminative learning can be further extended to more diverse tasks and modalities as perception and understanding on videos, point clouds, etc, providing broader benefit to different subfields of computer vision.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.