Withdraw
Loading…
Constructing and modeling text-rich information networks: a phrase mining-based approach
Liu, Jialu
Loading…
Permalink
https://hdl.handle.net/2142/92777
Description
- Title
- Constructing and modeling text-rich information networks: a phrase mining-based approach
- Author(s)
- Liu, Jialu
- Issue Date
- 2016-07-11
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Doctoral Committee Chair(s)
- Han, Jiawei
- Committee Member(s)
- Zhai, ChengXiang
- Parameswaran, Aditya
- Yu, Cong
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Text-Rich Information Networks
- Phrase Mining
- Heterogeneous Information Network
- Network Embedding
- Keyphrase Extraction
- Abstract
- "A lot of digital ink has been spilled on ""big data"" over the past few years, which is often characterized by an explosion of information. Most of this surge owes its origin to the unstructured data in the wild like words, images and video as comparing to the structured information stored in fielded form in databases. The proliferation of text-heavy data is particularly overwhelming, reflected in everyone's daily life in forms of web documents, business reviews, news, social posts, etc. In the mean time, textual data and structured entities often come in intertwined, such as authors/posters, document categories and tags, and document-associated geo locations. With this background, a core research challenge presents itself as how to turn massive, (semi-)unstructured data into structured knowledge. One promising paradigm studied in this dissertation is to integrate structured and unstructured data, constructing an organized heterogeneous information network, and developing powerful modeling mechanisms on such organized network. We name it text-rich information network, since it is an integrated representation of both structured and unstructured textual data. To thoroughly develop the construction and modeling paradigm, this dissertation will focus on forming a scalable data-driven framework and propose a new line of techniques relying on the idea of phrase mining to bridge textual documents and structured entities. We will first introduce the phrase mining method named SegPhrase+ to globally discover semantically meaningful phrases from massive textual data, providing a high quality dictionary for text structuralization. Clearly distinct from previous works that mostly focused on raw statistics of string matching, SegPhrase+ looks into the phrase context and effectively rectifies raw statistics to significantly boost the performance. Next, a novel algorithm based on latent keyphrases is developed and adopted to largely eliminate irregularities in massive text via providing an consistent and interpretable document representation. As a critical process in constructing the network, it uses the quality phrases generated in the previous step as candidates. From them a set of keyphrases are extracted to represent a particular document with inferred strength through a statistical model. After this step, documents become more structured and are consistently represented in the form of a bipartite network connecting documents with quality keyphrases. A more heterogeneous text-rich information network can be constructed by incorporating different types of document-associated entities as additional nodes. Lastly, a general and scalable framework, Tensor2vec, are to be added to trational data minining machanism, as the latter cannot readily solve the problem when the organized heterogeneous network has nodes with different types. Tensor2vec is expected to elegantly handle relevance search, entity classification, summarization and recommendation problems, by making use of higher-order link information and projecting multi-typed nodes into a shared low-dimensional vectorial space such that node proximity can be easily computed and accurately predicted."
- Graduation Semester
- 2016-08
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/92777
- Copyright and License Information
- Copyright 2016 Jialu Liu
Owning Collections
Dissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceGraduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…