Withdraw
Loading…
Discovering latent topical phrases in document collections and networks with text components: Leveraging text mining and information network analysis for human oriented applications
Danilevsky, Marina
Loading…
Permalink
https://hdl.handle.net/2142/49415
Description
- Title
- Discovering latent topical phrases in document collections and networks with text components: Leveraging text mining and information network analysis for human oriented applications
- Author(s)
- Danilevsky, Marina
- Issue Date
- 2014-05-30T16:42:47Z
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Doctoral Committee Chair(s)
- Han, Jiawei
- Committee Member(s)
- Zhai, ChengXiang
- Hockenmaier, Julia C.
- Koh, Eunyee
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- topical phrases
- topical hierarchy
- mining topical keyphrases
- topical community discovery
- Abstract
- One of the major challenges of mining topics from a large corpus is the quality of the constructed topics. While phrase-generating approaches generally produce high quality output, they do not scale very well with the size of the data. Thus, the state of the art solutions usually rely upon scalable unigram-generating methods, which do not produce high quality human-readable topics, or are forced to use external knowledge bases. Furthermore, while document collections naturally contain topics at different levels of granularity (general vs. specific), very few traditional methods focus on generating high quality hierarchical topic structures. This dissertation presents a series of approaches that directly addresses these challenges of generating high quality phrase-based topics, both as a flat set and organized as a hierarchy, as well as some potential applications. First, we describe a framework that generates high-quality topics represented by integrated lists of mixed-length phrases. The key is adapting a phrase-centric view towards the construction and ranking of topical phrases. The approach is domain-independent, and requires neither expert supervision nor an external knowledge base. The framework is initially constructed to work on collections of short texts, such as titles of scientific documents. However, we then show how the framework can be easily and robustly extended to work on collections of longer texts, and demonstrate its applicability to human needs with a task-centric evaluation. The dissertation then addresses the need to move beyond generating a flat set of topics, and present an approach to constructing hierarchical topics, which extends the phrase-centric approach to create high quality phrases at varying levels of granularity. Another application of this technique is then presented: the task of entity role discovery. By tying entities in a community to topical phrases, users are able to explicitly understand both how and why individual entities are ranked within a specific community. A final extension is then described, which is a combined approach for constructing the hierarchy, which uses entity link information to improve the hierarchy quality.
- Graduation Semester
- 2014-05
- Permalink
- http://hdl.handle.net/2142/49415
- Copyright and License Information
- Copyright 2014 Marina Danilevsky
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…