Explainable artificial intelligence for inclusive automatic speech recognition
Lee, Seunghyun
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/121386
Description
Title
Explainable artificial intelligence for inclusive automatic speech recognition
Author(s)
Lee, Seunghyun
Issue Date
2023-07-21
Director of Research (if dissertation) or Advisor (if thesis)
Hasegawa-Johnson, Mark A
Department of Study
Electrical & Computer Eng
Discipline
Electrical & Computer Engr
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Inclusive ASR
Explainable AI
ASR Visualization
Abstract
While the widespread adoption of automatic speech recognition (ASR) technology has brought significant benefits to society, it has also highlighted a persistent issue of inequality in access and utilization of technology. Furthermore, in response to the increasing prevalence of artificial intelligence applications, there has been a growing demand for explainable artificial intelligence (XAI). To address the need for interpretability and explainability in ASR, particularly in the context of inclusiveness, this paper aims to visualize the inner workings of the convolutional neural network (CNN) layer and Transformer block in Wav2Vec2.0. This is achieved by calculating the weighted relevance of the connectionist temporal classification (CTC) with respect to the attention and convolutional layers. Leveraging a Wav2Vec2.0 model pre-trained and fine-tuned on LibriSpeech, and testing the model using the Speech Accent Archive, we discovered that the Transformer exhibits a focus on other vowel transcriptions when encountering vowels within a word, whereas it exhibits a more localized attention when transcribing consonants or vowels in non-words absent from its learned vocabulary. Analysis of the weighted convolutional relevance in the first layer of the CNN revealed that different channels concentrate on distinct frequency and time sequences to capture the overall input characteristics. By obtaining a comprehensive understanding of the underlying causes and dynamics behind performance disparities, we can strive to mitigate these disparities and promote a more inclusive ASR technology.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.