Non-native text analysis with Syntactic Diff, a general comparative text mining framework

Massung, Sean Alexander

Non-native text analysis with Syntactic Diff, a general comparative text mining framework

Massung, Sean Alexander

Permalink

https://hdl.handle.net/2142/78606

Description

Title

Non-native text analysis with Syntactic Diff, a general comparative text mining framework

Author(s)

Massung, Sean Alexander

Issue Date

2015-04-15

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

text mining
natural language processing
comparative text mining
non-native text analysis
non-native text mining
second language education
non-native English speakers

Abstract

Non-native speakers of English far outnumber native speakers; English is the main language of books, newspapers, airports, air-traffic control, international business, academic conferences, science, technology, diplomacy, sports, international competitions, pop music, and advertising [1]. Online education in the form of MOOCs (massive online open courses) is also primarily in English— even teaching English. This creates enormous amounts of text written by non- native speakers, which in turn generates a need for grammar correction and analysis. Even aside from MOOCs, the number of English learners only in Asia alone is in the tens of millions. In response to this powerful motivation, we describe SYNTACTIC DIFF, a novel edit-based method for transforming sequences of words given a reference corpus. These transformations can be used directly or can be employed as features to represent text data in a wide variety of text mining scenarios. As case studies, we apply SYNTACTIC DIFF to four quite different tasks in non-native text analysis and show its benefit in each case. In the first task, we use weighted word edits with likelihood scoring for grammatical error correction. Our method is compared against systems in a grammar correction shared task, and we find that SYNTACTIC DIFF edits perform comparably while being much more general than the other methods. The second task is native language identification: a classification problem predicting the native language of a student writer based on English essays. We represent documents as vectors of edits, and show that a combination of unigram words and SYNTACTIC DIFF edits outperforms each representation individually. The third task is fluency scoring, in which we see if the manually categorized fluency levels of English students can be modeled by SYNTACTIC DIFF features. In the fourth task, we create clusters of student essays with similar errors via topic modeling, and find that the interpretability is significantly higher than an n-gram words approach. SYNTACTIC DIFF is highly customizable and able to capture syntactic differences from a reference corpus at the sentence, document, and subcorpus levels. This enables both a rich translation method and feature representation for many text mining tasks that deal with word usage and syntax beyond bag- of-words. In particular, this thesis focuses on non-native text analysis applications, though SYNTACTIC DIFF is not at all limited to that domain.

Graduation Semester

2015-5

Type of Resource

text

Permalink

http://hdl.handle.net/2142/78606

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Non-native text analysis with Syntactic Diff, a general comparative text mining framework

Massung, Sean Alexander

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In