Exploiting monolingual data for neural machine translation

Wang, Yiren

Exploiting monolingual data for neural machine translation

Wang, Yiren

Permalink

https://hdl.handle.net/2142/122004

Description

Title

Exploiting monolingual data for neural machine translation

Author(s)

Wang, Yiren

Issue Date

2023-11-21

Director of Research (if dissertation) or Advisor (if thesis)

Zhai, ChengXiang

Doctoral Committee Chair(s)

Zhai, ChengXiang

Committee Member(s)

Hockenmaier, Julia
Ji, Heng
Awadalla, Hany Hassan

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Neural Machine Translation
Dual Learning
Multi-task Learning

Abstract

Neural Machine Translation (NMT) has made rapid progress in recent years and achieved remarkable translation quality. NMT systems heavily rely on large-scale and high-quality parallel training data of source and target languages (bitext), which is limited and costly to collect. In the meantime, large amount of unlabeled monolingual data is available in different languages. While the target-side monolingual data has been proven useful through Back Translation, the source-side data is less investigated. Effectively exploiting both source-side and target-side monolingual data could not only help boost NMT performance, but also improve data efficiency. In this thesis, we focus on exploring novel approaches that effectively utilize monolingual data to improve the translation quality of NMT models. First, we exploit the potential of monolingual data with dual learning. Dual learning effectively utilize both bitext and monolingual data by leveraging the primal-dual structure of artificial intelligence tasks to generate informative feedback signals to regularize training. We extend the standard dual learning framework by introducing multiple primal and dual models, and propose the Multi-Agent Dual Learning (MADL) to further boost data utilization and translation quality. We show the effectiveness of our proposed approach on multiple benchmark NMT tasks, including both supervised and unsupervised NMT. Second, we study the effect of forward and back translation data at scale, and propose a large-scale noisy training pipeline to effectively leverage both of them. First, we generate synthetic bitext with both forward and back translation. Next, we train the NMT model on a noisy version of this synthetic corpus, where each source sentence is randomly corrupted. Finally, the model is fine-tuned on the genuine bitext and a clean, high-quality subset of the synthetic bitext without any noise. With our proposed strategy, we achieve the state-of-the-art performances on various benchmark translation tasks. Third, we explore more general strategies to utilize monolingual data for not only high-resource language pairs in bilingual NMT, but also low-resource language pairs and multilingual NMT as well. We propose a multi-task learning (MTL) framework, which jointly trains the model with the translation task on bitext data, and two auxiliary self-supervised learning tasks on the source-side and target-side monolingual data respectively. We show that our approach can effectively improve translation quality in the multilingual setting for both high-resource and low-resource languages. We also demonstrate the effectiveness of our method over pre-training approaches for both NMT and cross-lingual transfer learning natural language understanding tasks.

Graduation Semester

2023-12

Type of Resource

Thesis

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Exploiting monolingual data for neural machine translation

Wang, Yiren

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In