Withdraw
Loading…
Exploiting monolingual data for neural machine translation
Wang, Yiren
Loading…
Permalink
https://hdl.handle.net/2142/122004
Description
- Title
- Exploiting monolingual data for neural machine translation
- Author(s)
- Wang, Yiren
- Issue Date
- 2023-11-21
- Director of Research (if dissertation) or Advisor (if thesis)
- Zhai, ChengXiang
- Doctoral Committee Chair(s)
- Zhai, ChengXiang
- Committee Member(s)
- Hockenmaier, Julia
- Ji, Heng
- Awadalla, Hany Hassan
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Neural Machine Translation
- Dual Learning
- Multi-task Learning
- Abstract
- Neural Machine Translation (NMT) has made rapid progress in recent years and achieved remarkable translation quality. NMT systems heavily rely on large-scale and high-quality parallel training data of source and target languages (bitext), which is limited and costly to collect. In the meantime, large amount of unlabeled monolingual data is available in different languages. While the target-side monolingual data has been proven useful through Back Translation, the source-side data is less investigated. Effectively exploiting both source-side and target-side monolingual data could not only help boost NMT performance, but also improve data efficiency. In this thesis, we focus on exploring novel approaches that effectively utilize monolingual data to improve the translation quality of NMT models. First, we exploit the potential of monolingual data with dual learning. Dual learning effectively utilize both bitext and monolingual data by leveraging the primal-dual structure of artificial intelligence tasks to generate informative feedback signals to regularize training. We extend the standard dual learning framework by introducing multiple primal and dual models, and propose the Multi-Agent Dual Learning (MADL) to further boost data utilization and translation quality. We show the effectiveness of our proposed approach on multiple benchmark NMT tasks, including both supervised and unsupervised NMT. Second, we study the effect of forward and back translation data at scale, and propose a large-scale noisy training pipeline to effectively leverage both of them. First, we generate synthetic bitext with both forward and back translation. Next, we train the NMT model on a noisy version of this synthetic corpus, where each source sentence is randomly corrupted. Finally, the model is fine-tuned on the genuine bitext and a clean, high-quality subset of the synthetic bitext without any noise. With our proposed strategy, we achieve the state-of-the-art performances on various benchmark translation tasks. Third, we explore more general strategies to utilize monolingual data for not only high-resource language pairs in bilingual NMT, but also low-resource language pairs and multilingual NMT as well. We propose a multi-task learning (MTL) framework, which jointly trains the model with the translation task on bitext data, and two auxiliary self-supervised learning tasks on the source-side and target-side monolingual data respectively. We show that our approach can effectively improve translation quality in the multilingual setting for both high-resource and low-resource languages. We also demonstrate the effectiveness of our method over pre-training approaches for both NMT and cross-lingual transfer learning natural language understanding tasks.
- Graduation Semester
- 2023-12
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2023 Yiren Wang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…