Text cube: construction, summarization and mining

Tao, Fangbo

Text cube: construction, summarization and mining

Tao, Fangbo

Permalink

https://hdl.handle.net/2142/99244

Description

Title

Text cube: construction, summarization and mining

Author(s)

Tao, Fangbo

Issue Date

2017-12-06

Director of Research (if dissertation) or Advisor (if thesis)

Han, Jiawei

Doctoral Committee Chair(s)

Han, Jiawei

Committee Member(s)

Zhai, ChengXiang
Peng, Jian
Wang, Haixun

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Text cube
Data cube
Data mining
Natural language processing
Text classification

Abstract

A large portion of real world data is either text or structured (\eg, relational) data. Such data objects are often linked together (\eg, structured product information linking with their descriptions and customer reviews.). To systematically analyze large numbers of such textual documents, it is often desirable to manage the text data with the associated structured data in a multi-dimensional space (hence \emph{text cube}). This thesis studies the multi-dimensional representation of large textual data. Since Jim Gray introduced the concept of ``data cube'', data cube, associated with online analytical processing (OLAP), has become a driving engine in data warehouse industry. By modeling a large textual corpus as a ``cube'', \ie multi-dimensional and hierarchical structure, we bridge the power of traditional OLAP and Information Retrieval / Natural Language Processing techniques. In particular, this thesis focuses on two lines of work, one is to construct a multi-dimensional text cube from raw text data with limited user guidance; the other is to develop effective summarization and mining techniques tailored for multi-dimensional queries on text cubes. In the first part of the thesis, the problem of \emph{dimension-based structure creation} is studied. We propose an end-to-end framework for extracting multi-dimensional structure from a corpus, taking the input of a corpus of specific domain and limited seeds to generate a high-quality dimension values as output. We introduce the novel concept of Semantic Pattern Graph to leverage web signals to understand the underlying semantics of lexical patterns, improve pattern evaluation using mined semantics, and yield more accurate and complete structure. Experiments show the effectiveness of our approach. In the second part, with all the dimensions discovered, we study the problem of \emph{cell-based document allocation}. That is, linking the created dimensions with text data and construct a multi-dimensional text cube. To allocate documents into correct multi-dimensional subsets, \ie a cell. Traditional approaches, in this particular task, may require substantial labeling from user. Instead, we propose a model that requires no additional training data besides the given (label) name of each cube dimension as weak supervision. With such weak supervision, we develop a \emph{dimension-aware joint embedding} framework that learns joint representations for terms, documents, and labels. In the joint embedding process, our method iteratively learns dimension-aware document representations by selectively focusing on discriminative keywords for different dimensions. Furthermore, it alleviates label sparsity by leveraging label representations to enrich the labeled term set. Numerical experiments corroborate the effectiveness of our solution. In the third part, we introduce the concept of \emph{Context-Aware Semantic Online Analytical Processing} (\ie \emph{CASeOLAP}) in text cubes, and use \emph{top-$k$ representative phrases} to represent the semantics of the document subset in a text cube cell. By ranking phrases with a newly proposed ranking measure according to three criteria: integrity, popularity and distinctiveness. We identify phrases that can successfully digest the main content of a subset of documents of interest and contrast with other neighboring subsets. Our experiments in a large news dataset demonstrate the effectiveness of the newly proposed ranking measure in finding representative phrases and the efficiency in both query processing time and storage cost. The approach is also applied to clinical biomarker analysis and protest news analysis with success. In the last part, the system of \emph{EventCube} is proposed to support end-to-end pipeline of text cube in an informative, interactive, and user-friendly manner. The system serves as a general platform for construction, search, summarization, OLAP (online analytical processing) and data mining on integrated text and structured data. The system is a growing testbed for various text cube based research and has been successfully applied to NASA for aviation safety report analysis and Army Research Lab for Counter-Terrorism Report analysis. To summarize, this thesis provides important results of construction and consumption of multi-dimensional text cubes and shows its power in tackling real-world text analysis tasks.

Graduation Semester

2017-12

Type of Resource

text

Permalink

http://hdl.handle.net/2142/99244

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Text cube: construction, summarization and mining

Tao, Fangbo

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In