Withdraw
Loading…
Pitfalls and possibilities: What NLP systems are missing out on
Park, Hyunji (Hayley)
Loading…
Permalink
https://hdl.handle.net/2142/116178
Description
- Title
- Pitfalls and possibilities: What NLP systems are missing out on
- Author(s)
- Park, Hyunji (Hayley)
- Issue Date
- 2022-07-05
- Director of Research (if dissertation) or Advisor (if thesis)
- Schwartz, Lane
- Doctoral Committee Chair(s)
- Schwartz, Lane
- Committee Member(s)
- Hockenmaier, Julia
- Ji, Heng
- Tyers, Francis M.
- Department of Study
- Linguistics
- Discipline
- Linguistics
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- natural language processing
- computational linguistics
- morphology
- deep learning
- Abstract
- Despite recent advancements in natural language processing (NLP), there are still many areas in NLP that need much progress. In particular, this dissertation presents three studies, where careful consideration of datasets challenges the existing NLP methods. First, we present a study that augments the existing data to investigate the effect of morphology on LSTM language modeling. With NLP research disproportionally dedicated to English and a few other morphologically poor languages, the effect of morphology is clearly under-studied with a couple of previous papers disagreeing on the interaction between morphology and language modeling difficulty. By compiling a parallel Bible corpus and a linguistic typology database that represent morphological typology, we show that morphological complexity makes a language harder to model and affects the effectiveness of subword segmentation methods such as BPE. Next, we develop the first dependency treebank for St. Lawrence Island Yupik to show that morphology interacts with syntax in the polysynthetic language in the context of dependency parsing. We argue that the Universal Dependencies (UD) guidelines, which focus on word-level annotations, should be extended to morpheme-level annotations to better serve morphologically rich languages. Finally, we present a study on long document classification in English using Transformers, focusing on the validity of evaluation methods available for this newly developed task. By providing a comprehensive evaluation of existing models’ relative efficacy against various datasets and baselines, we show that existing models often fail to outperform simple baseline models and yield inconsistent performance across the datasets. The findings emphasize that future studies should consider comprehensive baselines and datasets that better represent the task of long document classification to develop robust models. In all, this dissertation sheds light on areas in NLP that need further investigation and emphasize the importance of careful consideration of the datasets involved.
- Graduation Semester
- 2022-08
- Type of Resource
- Thesis
- Copyright and License Information
- 2022 Hyunji (Hayley) Park
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…