Automated grammatical error detection for Chinese learners
Wang, Yiyi
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/115535
Description
Title
Automated grammatical error detection for Chinese learners
Author(s)
Wang, Yiyi
Issue Date
2022-04-14
Director of Research (if dissertation) or Advisor (if thesis)
Shih, Chilin
Doctoral Committee Chair(s)
Shih, Chilin
Committee Member(s)
Sadler, Misumi
Girju, Corina Roxana
Yan, Xun
Department of Study
E. Asian Languages & Cultures
Discipline
E Asian Languages & Cultures
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
Ph.D.
Degree Level
Dissertation
Keyword(s)
Automated grammatical error detection
natural language processing
corpus-based error analysis
Abstract
Grammatical Error Detection (GED) is an important application of Natural Language Processing (NLP) in Computer-Assisted Language Learning (CALL). A GED model takes a text written by language learners as input and identifies the positions of grammatical errors and the corresponding error types as output, which can be beneficial for language learners who want to obtain immediate diagnostic feedback on their writing. The availability of large-scale, well-annotated learner corpora enables researchers to explore a representative sample of learner errors and to analyze contextual and linguistic features that can guide the construction of automated GED models. Chinese is one of the most difficult languages to learn owing to its unique linguistic characteristics; however, developing an automated GED system for Chinese learners is an underexplored field of research that has the potential to assist an underserved group of learners.
This thesis investigates the grammatical errors made by Chinese learners using a large-scale, extensively annotated learner corpus. Error analysis can provide information on the nature of the GED task and assist in the practical development of error-detection tools. This thesis conducts a corpus-based analysis that explores the schema used to annotate the Chinese learner corpus and analyzes the distribution of errors by the learners' language backgrounds and proficiency levels. The findings of the corpus analysis are directly used to create a representative GED test set for evaluating model performance and directing the generation of synthetic data for training GED models. Due to the imbalanced nature of error types, this thesis evaluates GED model performance in terms of error types and makes recommendations for best practices for evaluating Chinese GED models.
Due to the difficulty of gathering annotated learner data, GED might be considered a low-resource task. Data augmentation is a frequently utilized strategy for resolving difficulties associated with the sparsity of training data in low-resource scenarios. Three strategies for data augmentation are discussed: tagged neural machine translation (NMT), untagged NMT, and rule-based methods. The results show that all three techniques contribute to improving the performance of the Chinese GED baseline model.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.