Multimodal LSTM for audio-visual speech recognition
Xu, Yijia
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/97896
Description
Title
Multimodal LSTM for audio-visual speech recognition
Author(s)
Xu, Yijia
Contributor(s)
Hasegawa-Johnson, Mark
Issue Date
2017-05
Keyword(s)
audio-visual speech recognition
speech recognition
long short term memory
connectionist temporal classification
multi layer perceptron
multimodal fusion
AVICAR
deep neural network
phoneme recognition
Abstract
Automatic speech recognition (ASR) permits effective interaction between humans and machines in environments where typing is impossible. Some environments, however, are more difficult than others: acoustic noise disrupts ASR. This research focuses on audio-visual speech recognition (AVSR), which serves to improve noise robustness during speech recognition with the aid of visual speech information from a speaker's mouth region. This research includes a lip tracking system, and a system for extracting effective audio and visual features for building an audio-visual speech recognition system. A context-independent phoneme dictionary is also built for extracting corresponding 42 phoneme labels for 3896 tri phone states (trained by Intel on Intel data). Two methods for audio-visual speech recognition are proposed and compared. The first method upsamples visual frames to force align with the audio frames as well as the context independent phoneme labels. Unimodel deep networks are trained using LSTM separately for audio and visual network on AVICAR dataset, and their posteriors are fused to obtain the multimodal speech recognition. The second method uses Connectionist Temporal Classification (CTC) objective function for LSTM. It does not require strict alignment between audio-visual frames and target labels. It automatically labels the unsegmented sequence of audio and visual data and then trains a classification neural network using a training criterion based on the automatic alignment, which is revised during every training iteration. The neural network trained is then used to perform audio-visual phoneme recognition. Results include an overall accuracy of 48.91% on audio only phoneme recognition, and an overall accuracy of 38.57% on visual only phoneme recognition, which outperforms the traditional deep neural networks in phoneme recognition accuracy of 24.39%. The audio-visual phoneme recognition achieves higher accuracy than the audio only speech recognition by 0.04%%. The CTC loss function turns the speech recognition into an end-to-end convenient process. It achieves relatively high recognition accuracy (72.97%) on small number of class classifications using best path decoding measurements.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.