Withdraw
Loading…
Non-native text analysis with Syntactic Diff, a general comparative text mining framework
Massung, Sean Alexander
Loading…
Permalink
https://hdl.handle.net/2142/78606
Description
- Title
- Non-native text analysis with Syntactic Diff, a general comparative text mining framework
- Author(s)
- Massung, Sean Alexander
- Issue Date
- 2015-04-15
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- text mining
- natural language processing
- comparative text mining
- non-native text analysis
- non-native text mining
- second language education
- non-native English speakers
- Abstract
- Non-native speakers of English far outnumber native speakers; English is the main language of books, newspapers, airports, air-traffic control, international business, academic conferences, science, technology, diplomacy, sports, international competitions, pop music, and advertising [1]. Online education in the form of MOOCs (massive online open courses) is also primarily in English— even teaching English. This creates enormous amounts of text written by non- native speakers, which in turn generates a need for grammar correction and analysis. Even aside from MOOCs, the number of English learners only in Asia alone is in the tens of millions. In response to this powerful motivation, we describe SYNTACTIC DIFF, a novel edit-based method for transforming sequences of words given a reference corpus. These transformations can be used directly or can be employed as features to represent text data in a wide variety of text mining scenarios. As case studies, we apply SYNTACTIC DIFF to four quite different tasks in non-native text analysis and show its benefit in each case. In the first task, we use weighted word edits with likelihood scoring for grammatical error correction. Our method is compared against systems in a grammar correction shared task, and we find that SYNTACTIC DIFF edits perform comparably while being much more general than the other methods. The second task is native language identification: a classification problem predicting the native language of a student writer based on English essays. We represent documents as vectors of edits, and show that a combination of unigram words and SYNTACTIC DIFF edits outperforms each representation individually. The third task is fluency scoring, in which we see if the manually categorized fluency levels of English students can be modeled by SYNTACTIC DIFF features. In the fourth task, we create clusters of student essays with similar errors via topic modeling, and find that the interpretability is significantly higher than an n-gram words approach. SYNTACTIC DIFF is highly customizable and able to capture syntactic differences from a reference corpus at the sentence, document, and subcorpus levels. This enables both a rich translation method and feature representation for many text mining tasks that deal with word usage and syntax beyond bag- of-words. In particular, this thesis focuses on non-native text analysis applications, though SYNTACTIC DIFF is not at all limited to that domain.
- Graduation Semester
- 2015-5
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/78606
- Copyright and License Information
- Copyright 2015 Sean Massung
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…