Withdraw
Loading…
Modeling audio and visual cues for real-world event detection
Zhuang, Xiaodan
Loading…
Permalink
https://hdl.handle.net/2142/24439
Description
- Title
- Modeling audio and visual cues for real-world event detection
- Author(s)
- Zhuang, Xiaodan
- Issue Date
- 2011-05-25T14:24:55Z
- Director of Research (if dissertation) or Advisor (if thesis)
- Hasegawa-Johnson, Mark A.
- Doctoral Committee Chair(s)
- Hasegawa-Johnson, Mark A.
- Committee Member(s)
- Huang, Thomas S.
- Levinson, Stephen E.
- Downie, J. Stephen
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- hidden Markov model
- Gaussian mixture model
- Acoustic Event Detection
- multimedia retrieval
- branch and bound
- Abstract
- Audio-visual event detection aims to identify semantically defined events that reveal human activities. Most previous literature focused on restricted highlight events, and depended on highly ad-hoc detectors for these events. This research emphasizes generalizable robust modeling of single-microphone audio cues and/or single-camera visual cues for the detection of real-world events, requiring no expensive annotation other than the known timestamps of the training events. To model the audio cues for event detection, we leverage statistical models proven effective in speech recognition. First, a tandem connectionist-HMM approach combines the sequence modeling capabilities of the hidden Markov model (HMM) with the context-dependent discriminative capabilities of an artificial neural network. Second, an SVM-GMM-supervector approach uses noise-robust kernels to approximate the KL divergence between feature distributions in different audio segments. The proposed methods outperform our top-ranked HMM-based acoustic event detection system in the CLEAR 2007 Evaluation, which detects twelve general meeting room events such as keyboard typing, cough and chair moving. To model the visual cues, we propose the Gaussianized vector representation, constructed by adapting a set of Gaussian mixtures according to the set of patch-based descriptors in an image or video clip, regularized by the global Gaussian mixture model. The innovative visual modeling approach establishes unsupervised correspondence between local descriptors in different images or video clips, and achieves outstanding performance in a video event categorization task on ten LSCOM-defined events in the Trecvid broadcast news data, such as exiting car, running and people marching. Following an efficient branch-and-bound search scheme, we further propose an object localization approach for the Gaussianized vector representation. We jointly model audio and visual cues for improved event detection using multi-stream HMMs and coupled HMMs (CHMM). Spatial pyramid histograms based on the optical flow are proposed as a generalizable visual representation that does not require training on labeled video data. In a multimedia meeting room non-speech event detection task, the proposed methods outperform previously reported systems leveraging ad-hoc visual object detectors and sound localization information obtained from multiple microphones.
- Graduation Semester
- 2011-05
- Permalink
- http://hdl.handle.net/2142/24439
- Copyright and License Information
- Copyright 2011 Xiaodan Zhuang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer EngineeringManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…