Multimodal LSTM for audio-visual speech recognition

Xu, Yijia

Multimodal LSTM for audio-visual speech recognition

Xu, Yijia

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/97896

Description

Title

Multimodal LSTM for audio-visual speech recognition

Author(s)

Xu, Yijia

Contributor(s)

Hasegawa-Johnson, Mark

Issue Date

2017-05

Keyword(s)

audio-visual speech recognition
speech recognition
long short term memory
connectionist temporal classification
multi layer perceptron
multimodal fusion
AVICAR
deep neural network
phoneme recognition

Abstract

Automatic speech recognition (ASR) permits effective interaction between humans and machines in environments where typing is impossible. Some environments, however, are more difficult than others: acoustic noise disrupts ASR. This research focuses on audio-visual speech recognition (AVSR), which serves to improve noise robustness during speech recognition with the aid of visual speech information from a speaker's mouth region. This research includes a lip tracking system, and a system for extracting effective audio and visual features for building an audio-visual speech recognition system. A context-independent phoneme dictionary is also built for extracting corresponding 42 phoneme labels for 3896 tri phone states (trained by Intel on Intel data). Two methods for audio-visual speech recognition are proposed and compared. The first method upsamples visual frames to force align with the audio frames as well as the context independent phoneme labels. Unimodel deep networks are trained using LSTM separately for audio and visual network on AVICAR dataset, and their posteriors are fused to obtain the multimodal speech recognition. The second method uses Connectionist Temporal Classification (CTC) objective function for LSTM. It does not require strict alignment between audio-visual frames and target labels. It automatically labels the unsegmented sequence of audio and visual data and then trains a classification neural network using a training criterion based on the automatic alignment, which is revised during every training iteration. The neural network trained is then used to perform audio-visual phoneme recognition. Results include an overall accuracy of 48.91% on audio only phoneme recognition, and an overall accuracy of 38.57% on visual only phoneme recognition, which outperforms the traditional deep neural networks in phoneme recognition accuracy of 24.39%. The audio-visual phoneme recognition achieves higher accuracy than the audio only speech recognition by 0.04%%. The CTC loss function turns the speech recognition into an end-to-end convenient process. It achieves relatively high recognition accuracy (72.97%) on small number of class classifications using best path decoding measurements.

Type of Resource

text

Language

Permalink

http://hdl.handle.net/2142/97896

Owning Collections

Senior Theses - Electrical and Computer Engineering PRIMARY

The best of ECE undergraduate research

Multimodal LSTM for audio-visual speech recognition

Xu, Yijia

Permalink

Description

Owning Collections

Senior Theses - Electrical and Computer Engineering PRIMARY

Log In