Learning shared semantic space for speech-to-text translation
Han, Chi
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/120363
Description
Title
Learning shared semantic space for speech-to-text translation
Author(s)
Han, Chi
Issue Date
2023-04-13
Director of Research (if dissertation) or Advisor (if thesis)
Ji, Heng
Department of Study
Computer Science
Discipline
Computer Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Speech-to-Text Translation
Natural Language Processing
Representation Learning
Abstract
End-to-end speech translation (ST) has far-reaching implications and numerous potential applications, making it an area of significant interest and impact. Despite its importance, ST has traditionally been treated as a separate task, failing to fully leverage the rapid ad- vancements in its closely related sibling - text machine translation (MT). This separation is due to the modality gap, which results from the different representations of text and audio inputs, rendering MT data and end-to-end models incompatible with their ST counterparts. In light of this challenge, we present Chimera, a novel approach designed to bridge the rep- resentation gap between these two modalities. Chimera achieves this by projecting audio and text features onto a common semantic representation, effectively unifying the MT and ST tasks. Consequently, Chimera enhances the performance on ST benchmarks, such as MuST-C and Augmented Librispeech, setting new state-of-the-art results. More specifically, Chimera attains a 27.1 BLEU score on the MuST-C EN-DE benchmark, improving the existing state-of-the-art by a substantial margin of +1.9 BLEU. Further experimental anal- yses substantiate that the shared semantic space indeed facilitates the exchange of common knowledge between the MT and ST tasks. We discovered identifiable semantic regions within the shared joint speech-text encoding space, highlighting the effective integration of both modalities. By plotting neural activation maps between parallel speech and text, we were able to visualize the convergence of semantic information, further demonstrating the success of our approach in bridging the modality gap and fostering a more robust understanding of the underlying linguistic structures. This finding paves the way for augmenting training resources across modalities and opens up new avenues for exploration in the field of speech translation.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.