Automated grammatical error detection for Chinese learners

Wang, Yiyi

Automated grammatical error detection for Chinese learners

Wang, Yiyi

Content Files

WANG-DISSERTATION-2022.pdf

Permalink

https://hdl.handle.net/2142/115535

Description

Title

Automated grammatical error detection for Chinese learners

Author(s)

Wang, Yiyi

Issue Date

2022-04-14

Director of Research (if dissertation) or Advisor (if thesis)

Shih, Chilin

Doctoral Committee Chair(s)

Shih, Chilin

Committee Member(s)

Sadler, Misumi
Girju, Corina Roxana
Yan, Xun

Department of Study

E. Asian Languages & Cultures

Discipline

E Asian Languages & Cultures

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Automated grammatical error detection
natural language processing
corpus-based error analysis

Abstract

Grammatical Error Detection (GED) is an important application of Natural Language Processing (NLP) in Computer-Assisted Language Learning (CALL). A GED model takes a text written by language learners as input and identifies the positions of grammatical errors and the corresponding error types as output, which can be beneficial for language learners who want to obtain immediate diagnostic feedback on their writing. The availability of large-scale, well-annotated learner corpora enables researchers to explore a representative sample of learner errors and to analyze contextual and linguistic features that can guide the construction of automated GED models. Chinese is one of the most difficult languages to learn owing to its unique linguistic characteristics; however, developing an automated GED system for Chinese learners is an underexplored field of research that has the potential to assist an underserved group of learners. This thesis investigates the grammatical errors made by Chinese learners using a large-scale, extensively annotated learner corpus. Error analysis can provide information on the nature of the GED task and assist in the practical development of error-detection tools. This thesis conducts a corpus-based analysis that explores the schema used to annotate the Chinese learner corpus and analyzes the distribution of errors by the learners' language backgrounds and proficiency levels. The findings of the corpus analysis are directly used to create a representative GED test set for evaluating model performance and directing the generation of synthetic data for training GED models. Due to the imbalanced nature of error types, this thesis evaluates GED model performance in terms of error types and makes recommendations for best practices for evaluating Chinese GED models. Due to the difficulty of gathering annotated learner data, GED might be considered a low-resource task. Data augmentation is a frequently utilized strategy for resolving difficulties associated with the sparsity of training data in low-resource scenarios. Three strategies for data augmentation are discussed: tagged neural machine translation (NMT), untagged NMT, and rule-based methods. The results show that all three techniques contribute to improving the performance of the Chinese GED baseline model.

Graduation Semester

2022-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/115535

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Automated grammatical error detection for Chinese learners

Wang, Yiyi

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In