Text mining with word embedding for outlier and sentiment analysis

Zhuang, Honglei

Text mining with word embedding for outlier and sentiment analysis

Zhuang, Honglei

Permalink

https://hdl.handle.net/2142/105058

Description

Title

Text mining with word embedding for outlier and sentiment analysis

Author(s)

Zhuang, Honglei

Issue Date

2019-04-19

Director of Research (if dissertation) or Advisor (if thesis)

Han, Jiawei

Doctoral Committee Chair(s)

Han, Jiawei

Committee Member(s)

Zhai, ChengXiang
Peng, Jian
Young, Joel D

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

text mining
word embedding
outlier analysis
sentiment analysis

Abstract

The technology today makes it unprecedentedly easy to collect and store massive text data in various domains such as online social networks, medical records and news reports. In contrast to the gigantic volume of text data, human capabilities to read and process text data is limited. Hence, there is an emerging demand for automatic text mining tools to analyze massive text data. Word embedding is an emerging text analysis technique that leverages the fine-grained statistics of context information to map each word to a vector in the embedding space which reflects the semantic proximity between words. Embedding techniques not only enrich the statistical signals to utilize in downstream text mining applications, but also provide the possibility to characterize and represent higher-level objects in the embedding space, such as sentences, documents or topics. This study integrates word embedding techniques into a series of text mining approaches and models. The general idea is to take a text object such as a document or a sentence as a bag of embedding vectors and characterize their distributions in the embedding space. Specifically, this study focuses on two tasks: outlier analysis and weakly-supervised sentiment analysis. Outlier analysis aims to identify documents that topically deviate from the majority of a given corpus. We develop an unsupervised generative model to identify frequent and representative semantic regions in the word embedding space to represent the given corpus. Then we propose a novel outlierness measure to identify outlier documents. We also study the cost-sensitive scenario of outlier analysis. Sentiment analysis typically identifies the subjective opinion (e.g., positive vs. negative) in a piece of text. Despite being extensively studied as a supervised learning task, we tackle the problem in a weakly-supervised fashion, where users only provide a small set of seed words as guidance. We study to identify aspects and corresponding sentiments at both document and sentence level.

Graduation Semester

2019-05

Type of Resource

text

Permalink

http://hdl.handle.net/2142/105058

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Text mining with word embedding for outlier and sentiment analysis

Zhuang, Honglei

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In