Text mining at multiple granularity: leveraging subwords, words, phrases, and sentences

El-Kishky, Ahmed Hassan

Text mining at multiple granularity: leveraging subwords, words, phrases, and sentences

El-Kishky, Ahmed Hassan

Permalink

https://hdl.handle.net/2142/108161

Description

Title

Text mining at multiple granularity: leveraging subwords, words, phrases, and sentences

Author(s)

El-Kishky, Ahmed Hassan

Issue Date

2020-05-06

Director of Research (if dissertation) or Advisor (if thesis)

Han, Jiawei

Doctoral Committee Chair(s)

Han, Jiawei
Zhai, ChengXiang
Abdelzaher, Tarek
Zhang, Joy

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

subwords
phrases
sentences
embedding
data mining
nlp, cross-lingual

Abstract

With the rapid digitization of information, large quantities of text-heavy data is being constantly generated in many languages and across domains such as web documents, research papers, business reviews, news, and social posts. As such, efficiently and effectively searching, organizing, and extracting meaningful information and data from these massive unstructured corpora is essential to laying the foundation for many downstream text mining and natural language processing (NLP) tasks. Traditionally, NLP and text mining techniques are applied to the raw texts while treating individual words as the base semantic unit. However the assumption that individual word-tokens are the correct semantic granularity does not hold for many tasks and can lead to many problems and poor task performance. To address this, this work introduces techniques for identifying and utilizing text at different semantic granularity to solve a variety of text mining and NLP tasks. The general idea is to take a text object such as a document, and decompose it to many levels of semantic granularity such as sentences, phrases, words, or subword structures. Once the text in represented at different levels of semantic granularity, we demonstrate techniques that can leverage the properly encoded text to solve a variety of NLP tasks. Specifically, this study focuses on three levels of semantic granularity: (1) subword segmentation with an application to enriching word embeddings to address word sparsity (2) phrase mining with an application to phrase-based topic modeling and (3) leveraging sentence-level granularity for finding parallel cross-lingual data. The first granularity we study is subword-level. We introduce a subword mining problem that aims to segment individual word tokens into smaller subword structures. The motivation is that, often, individual words are too coarse of a granularity and need to be supplemented by a finer semantic granularity. Operating on these fine-grained subwords addresses many important problems in NLP namely the long-tail data-sparsity problem whereby most words in a corpus are infrequent, and the more severe out-of-vocabulary problem. To effectively and efficiently mine these subword structures, we propose an unsupervised segmentation algorithm based off a novel objective: transition entropy. We use ground-truth segmentation to assess the quality of the segmented words and further demonstrate the benefit of jointly leveraging words and subwords for distributed word representations. The second granularity we study is phrase-level and the phrase mining task to transform raw unstructured text from a fine-grained sequence of words into a coarser-granularity sequence of single and multi-word phrases. The motivation is that, often, human language contains idiomatic multi-word expressions and fine-grained words fail to capture the right semantic granularity; proper phrasal segmentation can capture this true appropriate semantic granularity. To address this problem, we propose an unsupervised phrase mining algorithm based on frequent significant contiguous text patterns. We use human-evaluation to assess the quality of the mined phrases and demonstrate the benefit of pre-mining phrases on a downstream topic-modeling task. The third granularity we study is sentence-level granularity. We motivate the need for a sentence-level granularity for capturing more complex semantically complete spans of texts. We introduce several downstream tasks that leverage sentence representations in conjunction with finer-grained units in a cross-lingual text mining task. We experimentally show how leveraging sentence-level data for cross-lingual embeddings can be used to identify cross-lingual document pairs and parallel sentences – data necessary for training machine translation models.

Graduation Semester

2020-05

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/108161

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Text mining at multiple granularity: leveraging subwords, words, phrases, and sentences

El-Kishky, Ahmed Hassan

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In