Withdraw
Loading…
Automated grammatical error detection for Chinese learners
Wang, Yiyi
Loading…
Permalink
https://hdl.handle.net/2142/115535
Description
- Title
- Automated grammatical error detection for Chinese learners
- Author(s)
- Wang, Yiyi
- Issue Date
- 2022-04-14
- Director of Research (if dissertation) or Advisor (if thesis)
- Shih, Chilin
- Doctoral Committee Chair(s)
- Shih, Chilin
- Committee Member(s)
- Sadler, Misumi
- Girju, Corina Roxana
- Yan, Xun
- Department of Study
- E. Asian Languages & Cultures
- Discipline
- E Asian Languages & Cultures
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Automated grammatical error detection
- natural language processing
- corpus-based error analysis
- Abstract
- Grammatical Error Detection (GED) is an important application of Natural Language Processing (NLP) in Computer-Assisted Language Learning (CALL). A GED model takes a text written by language learners as input and identifies the positions of grammatical errors and the corresponding error types as output, which can be beneficial for language learners who want to obtain immediate diagnostic feedback on their writing. The availability of large-scale, well-annotated learner corpora enables researchers to explore a representative sample of learner errors and to analyze contextual and linguistic features that can guide the construction of automated GED models. Chinese is one of the most difficult languages to learn owing to its unique linguistic characteristics; however, developing an automated GED system for Chinese learners is an underexplored field of research that has the potential to assist an underserved group of learners. This thesis investigates the grammatical errors made by Chinese learners using a large-scale, extensively annotated learner corpus. Error analysis can provide information on the nature of the GED task and assist in the practical development of error-detection tools. This thesis conducts a corpus-based analysis that explores the schema used to annotate the Chinese learner corpus and analyzes the distribution of errors by the learners' language backgrounds and proficiency levels. The findings of the corpus analysis are directly used to create a representative GED test set for evaluating model performance and directing the generation of synthetic data for training GED models. Due to the imbalanced nature of error types, this thesis evaluates GED model performance in terms of error types and makes recommendations for best practices for evaluating Chinese GED models. Due to the difficulty of gathering annotated learner data, GED might be considered a low-resource task. Data augmentation is a frequently utilized strategy for resolving difficulties associated with the sparsity of training data in low-resource scenarios. Three strategies for data augmentation are discussed: tagged neural machine translation (NMT), untagged NMT, and rule-based methods. The results show that all three techniques contribute to improving the performance of the Chinese GED baseline model.
- Graduation Semester
- 2022-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Yiyi Wang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…