Withdraw
Loading…
Unsupervised speech technology for low-resource languages
Gao, Heting
Loading…
Permalink
https://hdl.handle.net/2142/124176
Description
- Title
- Unsupervised speech technology for low-resource languages
- Author(s)
- Gao, Heting
- Issue Date
- 2024-04-09
- Director of Research (if dissertation) or Advisor (if thesis)
- Hasegawa-Johnson, Mark
- Doctoral Committee Chair(s)
- Hasegawa-Johnson, Mark
- Committee Member(s)
- Smaragdis, Paris
- Bhat, Suma P
- Tang, Yan
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- unsupervised learning
- low-resource language
- cross-lingual phonetic recognition
- unsupervised speech synthesis
- diffusion model
- self-supervised learning
- multimodal learning
- Abstract
- Deep neural network based speech processing systems have found widespread applications in daily life, being employed for tasks such as automatic speech recognition (ASR), text-to-speech (TTS) synthesis, spoken language understanding (SLU), etc. With a sufficient amount of parallel speech-text training data, these systems attain performance levels comparable to, or in some cases, even better than human capabilities. However, such sufficient data assumption holds for only resource-rich languages such as English and Mandarin Chinese, and is unrealistic for many existing low-resource languages, posing a challenge for these systems to attain similar high performance. It is therefore meaningful to improve speech processing systems in such conditions to make speech technology accessible to a broader population. Unsupervised learning has been an active research field to mitigate data sparsity of low-resource languages. Depending on different source-target scenarios, unsupervised learning can be classified into four categories: (1) self-supervised learning (SSL), (2) modality matching, (3) unsupervised transfer learning, and (4) unsupervised multimodal learning. This thesis introduces six projects that leverage unsupervised learning methods to improve speech processing systems. The first project pretrains the SSL models on monolingual, cross-lingual, and multimodal data to study the cross-lingual transferability of SSL models. The second project improves the SSL representations using synthetic speech generated by a diffusion-based unit-to-speech synthesizer. The third project falls under modality matching, where we build the first unsupervised speech-to-text system using unsupervised automatic speech recognition technology. The fourth project falls under unsupervised transfer learning, where we improve zero-shot phonetic recognition system using language embeddings derived from external linguistic databases, without requiring any training data from the target languages. The fifth project also falls under transfer learning, where we build a multimodal few-shot SLU system by prompting a frozen pretrained language model with text and acoustic embeddings. The sixth project falls under unsupervised transfer learning, where we improve the current grapheme-to-phoneme (G2P) transducer by integrating the grapheme-to-phoneme model with a unit-to-phoneme (U2P) model, aiming to regularize G2P model outputs without relying on ground truth phoneme transcripts as training labels. This thesis demonstrates that unsupervised learning methods can significantly improve the performance of speech recognition, speech synthesis, and speech understanding in low-resourced application scenarios.
- Graduation Semester
- 2024-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2024 Heting Gao
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…