Withdraw
Loading…
Efficient audio-visual representations for reasoning and synthesis tasks
Chatterjee, Moitreya
Loading…
Permalink
https://hdl.handle.net/2142/117723
Description
- Title
- Efficient audio-visual representations for reasoning and synthesis tasks
- Author(s)
- Chatterjee, Moitreya
- Issue Date
- 2022-10-28
- Director of Research (if dissertation) or Advisor (if thesis)
- Ahuja, Narendra
- Doctoral Committee Chair(s)
- Ahuja, Narendra
- Committee Member(s)
- Hasegawa-Johnson, Mark A
- Do, Minh N
- Gupta, Saurabh
- Wang, Yuxiong
- Owens, Andrew
- Harwath, David
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Audio-Visual
- Scene Understanding
- Multimodal
- Frame Generation
- Audio Source Separation
- Machine Learning
- Computer Vision
- Abstract
- Events that occur in the real world often leave acoustic and visual imprints. Highly evolved organisms, such as human beings, have capabilities to perceive such events jointly across multiple modalities (such as audio and video). This allows for faster and more accurate perception with the possibility of supplementing information lacunae in one modality by another. As we seek to enable artificial intelligence (AI) systems to complement humans in their endeavors, it is therefore critical that they be equipped with such mulitmodal reasoning capabilities as well and be able to undertake tasks that humans usually solve effortlessly. Towards this end, this dissertation explores reasoning and synthesis tasks in the audio-visual space, particularly, the tasks of: (a) disambiguating mixed/noisy audio by leveraging video, and (b) being able to predict a video from audio. This dissertation proposes methods to tackle these challenges while striving to ensure that the proposed solutions be deployable in resource-constrained environments where there might be an inadequacy of high-performance computing resources or a paucity of training data. Concretely this dissertation makes the following four contributions: (i) a novel geometry-aware sparse scene graph based representation is proposed to undertake audio-source separation given videos of the audio sources in their natural settings; (ii) a multimodal, variational encoder-decoder model, called Sound2Sight, is introduced to synthesize the frames of a video given the audio coherently; (iii) an improved regime for training, unimodal, and audio-conditioned frame prediction systems, which factors in the predictive uncertainty of the model, is put forth. This adaptation results in such models requiring lesser data and fewer epochs for training; (iv) finally, a method to compress the dominant image representation tool in multimodal deep neural networks, Convolutional Neural Network (CNN), is presented so that they might operate in environments with lower computation capacity by exploiting filter activation patterns and inter-filter weight dependencies.
- Graduation Semester
- 2022-12
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Moitreya Chatterjee
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…