Modeling audio and visual cues for real-world event detection

Zhuang, Xiaodan

Modeling audio and visual cues for real-world event detection

Zhuang, Xiaodan

Permalink

https://hdl.handle.net/2142/24439

Description

Title

Modeling audio and visual cues for real-world event detection

Author(s)

Zhuang, Xiaodan

Issue Date

2011-05-25T14:24:55Z

Director of Research (if dissertation) or Advisor (if thesis)

Hasegawa-Johnson, Mark A.

Doctoral Committee Chair(s)

Hasegawa-Johnson, Mark A.

Committee Member(s)

Huang, Thomas S.
Levinson, Stephen E.
Downie, J. Stephen

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

hidden Markov model
Gaussian mixture model
Acoustic Event Detection
multimedia retrieval
branch and bound

Abstract

Audio-visual event detection aims to identify semantically defined events that reveal human activities. Most previous literature focused on restricted highlight events, and depended on highly ad-hoc detectors for these events. This research emphasizes generalizable robust modeling of single-microphone audio cues and/or single-camera visual cues for the detection of real-world events, requiring no expensive annotation other than the known timestamps of the training events. To model the audio cues for event detection, we leverage statistical models proven effective in speech recognition. First, a tandem connectionist-HMM approach combines the sequence modeling capabilities of the hidden Markov model (HMM) with the context-dependent discriminative capabilities of an artificial neural network. Second, an SVM-GMM-supervector approach uses noise-robust kernels to approximate the KL divergence between feature distributions in different audio segments. The proposed methods outperform our top-ranked HMM-based acoustic event detection system in the CLEAR 2007 Evaluation, which detects twelve general meeting room events such as keyboard typing, cough and chair moving. To model the visual cues, we propose the Gaussianized vector representation, constructed by adapting a set of Gaussian mixtures according to the set of patch-based descriptors in an image or video clip, regularized by the global Gaussian mixture model. The innovative visual modeling approach establishes unsupervised correspondence between local descriptors in different images or video clips, and achieves outstanding performance in a video event categorization task on ten LSCOM-defined events in the Trecvid broadcast news data, such as exiting car, running and people marching. Following an efficient branch-and-bound search scheme, we further propose an object localization approach for the Gaussianized vector representation. We jointly model audio and visual cues for improved event detection using multi-stream HMMs and coupled HMMs (CHMM). Spatial pyramid histograms based on the optical flow are proposed as a generalizable visual representation that does not require training on labeled video data. In a multimedia meeting room non-speech event detection task, the proposed methods outperform previously reported systems leveraging ad-hoc visual object detectors and sound localization information obtained from multiple microphones.

Graduation Semester

2011-05

Permalink

http://hdl.handle.net/2142/24439

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Modeling audio and visual cues for real-world event detection

Zhuang, Xiaodan

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In