Withdraw
Loading…
Statistical Model Based Multi-Microphone Speech Processing: Toward Overcoming Mismatch Problem
Kim, Lae-Hoon
Loading…
Permalink
https://hdl.handle.net/2142/16839
Description
- Title
- Statistical Model Based Multi-Microphone Speech Processing: Toward Overcoming Mismatch Problem
- Author(s)
- Kim, Lae-Hoon
- Issue Date
- 2010-08-20T17:59:28Z
- Director of Research (if dissertation) or Advisor (if thesis)
- Hasegawa-Johnson, Mark A.
- Doctoral Committee Chair(s)
- Hasegawa-Johnson, Mark A.
- Committee Member(s)
- Levinson, Stephen E.
- Do, Minh N.
- Fleck, Margaret M.
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Independent component analysis
- beamforming
- Expectation maximization beamforming (EMB)
- robust automatic speech recognition
- missing feature
- Abstract
- In this thesis, a joint optimal method for clean speech estimation and ASR in a mismatched condition will be described with a unified speech model under a generalized expectation maximization (GEM) scheme. From this perspective, multi-microphone optimal speech estimation can be interpreted as pre-processing to increase reliability of feature components before the actual speech recognition or model based speech estimation is performed. Also, ideal binary mask (IBM) estimation from the context of the statistical model for ASR can be regarded as an initialization step to exclude the unreliable portion for ASR and to increase the estimation accuracy based only on the reliable components and trained speech process model. Optimal multi-microphone speech processing is performed in the short-time Fourier transform (STFT) domain, since the atomic speech information can be meaningfully represented with a series of 10 to 30 ms short frames. Convolution in the time domain is formulated as filtering via a feed-forward network in the STFT domain, and is shown to be an appropriate representation under the overlap-add framework. With this structure in mind, sufficient statistics for estimating target speech from the multi-microphone measurements are formulated, and realistic relaxations for them are discussed since we need to estimate not only the target speech information but also the room impulse responses (RIRs), which have unavoidable uncertainty due to the movement of speakers. Firstly, reverberant speech mixture separation with typical background noise is tackled. Standard adaptive independent component analysis (ICA) implemented with the natural gradient method is extended into the STFT domain with regularized feed-forward ICA (RFFICA) and post-processing based on direction-per-frequency. This method showed up to almost an order of magnitude performance improvement (29 dB in C-weighting) compared with the state of the art methods. Secondly, we try to update the filters fast enough, with a smaller amount of measured data sharing the same directional information about target and interference location. Expectation maximization beamforming (EMB) followed by minimum mean squared error (MMSE) post-filtering is proposed to reduce the number of filter taps to update. Because we can obtain generative model based information about the target speech presence probability per each frequency bin and per each frame with enhanced robust DOA estimation capability, EMB can also be used to replace the direction-per-frequency based post-processing, which has been applied independently after RFFICA. Thirdly, the DOA only based beamforming is extended to early response based beamforming. We estimate the RIRs from target and interference speech given the robust estimation on DOAs and construct linearly constrained minimum variance (LCMV) beamforming, which can be easily extended with the EMB framework. Because we perform a two-step approach, estimating RIR first and applying a demixing filter, without introducing more taps in the frame for adaptation purposes, we can have good demixing or dereverberation results. Finally, IBM estimation and ASR are jointly formulated under a GEM framework. Even with the optimal front-end pre-processing, there always exists a mismatched portion with the statistical speech process model which is going to be used for ASR. Therefore, identifying the corrupted portions and removing them in ASR from the perspective of ASR itself is a necessary procedure. The cepstral domain ASR models are transformed into the spectral domain without loss of information through the global tying process. The proposed algorithm achieved much higher absolute ASR accuracy, ranging from 14.69% at 0 dB signal-to-noise ratio (SNR) to 40.10% at 15 dB SNR, than a normal ASR method with an optimal front-end processing in a highly non-stationary mismatch environment.
- Graduation Semester
- 2010-08
- Permalink
- http://hdl.handle.net/2142/16839
- Copyright and License Information
- Copyright 2010 Lae-Hoon Kim
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer EngineeringManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…