Beyond topic-based representations for text mining

Massung, Sean Alexander

Beyond topic-based representations for text mining

Massung, Sean Alexander

Permalink

https://hdl.handle.net/2142/97736

Description

Title

Beyond topic-based representations for text mining

Author(s)

Massung, Sean Alexander

Issue Date

2017-04-19

Director of Research (if dissertation) or Advisor (if thesis)

Zhai, ChengXiang

Doctoral Committee Chair(s)

Zhai, ChengXiang

Committee Member(s)

Hockenmaier, Julia
Roth, Dan
LaValle, Steve
Mei, Qiaozhu

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Date of Ingest

2017-08-10T20:33:07Z

Keyword(s)

Text mining
Text representation
Feature representation
Natural language processing

Abstract

"A massive amount of online information is natural language text: newspapers, blog articles, forum posts and comments, tweets, scientific literature, government documents, and more. While in general, all kinds of online information is useful, textual information is especially important—it is the most natural, most common, and most expressive form of information. Text representation plays a critical role in application tasks like classification or information retrieval since the quality of the underlying feature space directly impacts each task's performance. Because of this importance, many different approaches have been developed for generating text representations. By far, the most common way to generate features is to segment text into words and record their n-grams. While simple term features perform relatively well in topic-based tasks, not all downstream applications are of a topical nature and can be captured by words alone. For example, determining the native language of an English essay writer will depend on more than just word choice. Competing methods to topic-based representations (such as neural networks) are often not interpretable or rely on massive amounts of training data. This thesis proposes three novel contributions to generate and analyze a large space of non-topical features. First, structural parse tree features are solely based on structural properties of a parse tree by ignoring all of the syntactic categories in the tree. An important advantage of these ""skeletons"" over regular syntactic features is that they can capture global tree structures without causing problems of data sparseness or overfitting. Second, SyntacticDiff explicitly captures differences in a text document with respect to a reference corpus, creating features that are easily explained as weighted word edit differences. These edit features are especially useful since they are derived from information not present in the current document, capturing a type of comparative feature. Third, Cross-Context Lexical Analysis is a general framework for analyzing similarities and differences in both term meaning and representation with respect to different, potentially overlapping partitions of a text collection. The representations analyzed by CCLA are not limited to topic-based features."

Graduation Semester

2017-05

Type of Resource

text

Permalink

http://hdl.handle.net/2142/97736

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Beyond topic-based representations for text mining

Massung, Sean Alexander

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In