Withdraw
Loading…
Semi-supervised cycle-consistency training for end-to-end ASR using unpaired speech
Wu, Ningkai
Loading…
Permalink
https://hdl.handle.net/2142/108196
Description
- Title
- Semi-supervised cycle-consistency training for end-to-end ASR using unpaired speech
- Author(s)
- Wu, Ningkai
- Issue Date
- 2020-05-14
- Director of Research (if dissertation) or Advisor (if thesis)
- Hasegawa-Johnson, Mark
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Speech recognition
- Semi-supervised training
- Abstract
- The thesis is a replication of the work by Takaaki Hori and his colleagues (2019), which introduces a new method to train end-to-end automatic speech recognition (ASR) models using unpaired speech. In general, large amounts of paired data (speech and text) are needed to train an end-to-end automatic speech recognition system. To alleviate the problem of limited paired data, the idea of cycle-consistency losses has been proposed recently in areas such as machine translation and computer vision. In ASR, cycle-consistency training is achieved by building a reverse system, e.g., a text-to-speech system, and designing a loss based on the reconstructed signal and the original one. However, it is not straightforward to apply cycle-consistency in ASR as information would be lost in the text bottleneck. Tomoki Hayashi et al. (2018) tackled this problem via a text-to-encoder (TTE) model, which predicts encoder states extracted by a pre-trained end-to-end ASR encoder from text input. In this work, the TTE model was used as the reverse system and a loss was defined by comparing the original ASR encoder states and the reconstructed encoder states from the TTE model. Using encoder states instead of raw acoustic features as targets, the model can learn attention much faster and avoid the modeling of speaker dependencies. Our experimental results on the LibriSpeech corpus were similar to the results of Hori et al. The initial ASR and TTE models were trained with LibriSpeech 100-hour paired speech data. By applying cycle-consistency loss and retraining the speech-to-text-to-encoder chain model using one third of LibriSpeech 360-hour unpaired speech data, ASR word error rate was reduced from 25.8% to 21.7% on the LibriSpeech 5-hour test data.
- Graduation Semester
- 2020-05
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/108196
- Copyright and License Information
- Copyright 2020 Ningkai Wu
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer EngineeringManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…