Statistical Model Based Multi-Microphone Speech Processing: Toward Overcoming Mismatch Problem

Kim, Lae-Hoon

Statistical Model Based Multi-Microphone Speech Processing: Toward Overcoming Mismatch Problem

Kim, Lae-Hoon

Permalink

https://hdl.handle.net/2142/16839

Description

Title

Statistical Model Based Multi-Microphone Speech Processing: Toward Overcoming Mismatch Problem

Author(s)

Kim, Lae-Hoon

Issue Date

2010-08-20T17:59:28Z

Director of Research (if dissertation) or Advisor (if thesis)

Hasegawa-Johnson, Mark A.

Doctoral Committee Chair(s)

Hasegawa-Johnson, Mark A.

Committee Member(s)

Levinson, Stephen E.
Do, Minh N.
Fleck, Margaret M.

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Independent component analysis
beamforming
Expectation maximization beamforming (EMB)
robust automatic speech recognition
missing feature

Abstract

In this thesis, a joint optimal method for clean speech estimation and ASR in a mismatched condition will be described with a unified speech model under a generalized expectation maximization (GEM) scheme. From this perspective, multi-microphone optimal speech estimation can be interpreted as pre-processing to increase reliability of feature components before the actual speech recognition or model based speech estimation is performed. Also, ideal binary mask (IBM) estimation from the context of the statistical model for ASR can be regarded as an initialization step to exclude the unreliable portion for ASR and to increase the estimation accuracy based only on the reliable components and trained speech process model. Optimal multi-microphone speech processing is performed in the short-time Fourier transform (STFT) domain, since the atomic speech information can be meaningfully represented with a series of 10 to 30 ms short frames. Convolution in the time domain is formulated as filtering via a feed-forward network in the STFT domain, and is shown to be an appropriate representation under the overlap-add framework. With this structure in mind, sufficient statistics for estimating target speech from the multi-microphone measurements are formulated, and realistic relaxations for them are discussed since we need to estimate not only the target speech information but also the room impulse responses (RIRs), which have unavoidable uncertainty due to the movement of speakers. Firstly, reverberant speech mixture separation with typical background noise is tackled. Standard adaptive independent component analysis (ICA) implemented with the natural gradient method is extended into the STFT domain with regularized feed-forward ICA (RFFICA) and post-processing based on direction-per-frequency. This method showed up to almost an order of magnitude performance improvement (29 dB in C-weighting) compared with the state of the art methods. Secondly, we try to update the filters fast enough, with a smaller amount of measured data sharing the same directional information about target and interference location. Expectation maximization beamforming (EMB) followed by minimum mean squared error (MMSE) post-filtering is proposed to reduce the number of filter taps to update. Because we can obtain generative model based information about the target speech presence probability per each frequency bin and per each frame with enhanced robust DOA estimation capability, EMB can also be used to replace the direction-per-frequency based post-processing, which has been applied independently after RFFICA. Thirdly, the DOA only based beamforming is extended to early response based beamforming. We estimate the RIRs from target and interference speech given the robust estimation on DOAs and construct linearly constrained minimum variance (LCMV) beamforming, which can be easily extended with the EMB framework. Because we perform a two-step approach, estimating RIR first and applying a demixing filter, without introducing more taps in the frame for adaptation purposes, we can have good demixing or dereverberation results. Finally, IBM estimation and ASR are jointly formulated under a GEM framework. Even with the optimal front-end pre-processing, there always exists a mismatched portion with the statistical speech process model which is going to be used for ASR. Therefore, identifying the corrupted portions and removing them in ASR from the perspective of ASR itself is a necessary procedure. The cepstral domain ASR models are transformed into the spectral domain without loss of information through the global tying process. The proposed algorithm achieved much higher absolute ASR accuracy, ranging from 14.69% at 0 dB signal-to-noise ratio (SNR) to 40.10% at 15 dB SNR, than a normal ASR method with an optimal front-end processing in a highly non-stationary mismatch environment.

Graduation Semester

2010-08

Permalink

http://hdl.handle.net/2142/16839

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Statistical Model Based Multi-Microphone Speech Processing: Toward Overcoming Mismatch Problem

Kim, Lae-Hoon

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In