Withdraw
Loading…
Annotation-free knowledge mining from massive text corpora
Gu, Xiaotao
Loading…
Permalink
https://hdl.handle.net/2142/115488
Description
- Title
- Annotation-free knowledge mining from massive text corpora
- Author(s)
- Gu, Xiaotao
- Issue Date
- 2022-04-19
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Doctoral Committee Chair(s)
- Han, Jiawei
- Committee Member(s)
- Ji, Heng
- Abdelzaher, Tarek
- Yu, Cong
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- text mining
- knowledge mining
- Abstract
- Recent years have witnessed unprecedented information explosion. Text data, as a major carrier of information and knowledge, is emerging in blazing speed thanks to the development of the Internet. Despite its great value, the overwhelming amount of text data brings great challenges, both for human readers to consume and for machines to process. To fully unleash the power of text data, our goal is to mine, organize and summarize structured knowledge from massive text corpora in an intelligent and effort-light manner. Existing models for text mining heavily lean on excessive human annotation, or manually curated knowledge bases, which are not only expensive, but also hard to transfer to new domains. In this work, we aim to alleviate the need for human annotation, where we directly mine high-quality supervision signals from the input corpora for model learning. We show that through self-supervised tasks on the corpora, models can effectively capture rich patterns in text, and learn to extract information and generate text with the guidance of such patterns as silver labels. In this thesis, we outline a general framework to mine the knowledge pyramid of knowledge from different granularities to satisfy various information needs. We explore the possibility of annotation-free knowledge mining on phrase-level, sentence-level, document-level, and multi-document-level tasks: 1. Automated Phrase Mining. Phrases are arguably the basic semantic unit for text understanding. We start from introducing UCPhrase, an unsupervised context-aware phrase tagging model, which does not rely on any human annotation or knowledge bases. We show that silver labels mined from unlabeled corpora can replace and even outperform distantly fetched labels from existing knowledge bases. We further propose to leverage attention maps generated by pre-trained language models to extract informative features about sentence structures, which alleviate the frequency bias and can effectively capture emerging infrequent phrases. 2. Phrase-aware Sentence Parsing. With the knowledge about phrases, we further study how phrases are connected to form the structure of a sentence. We propose to model sentence parsing as a three-stage process: (1) extract obvious phrases (e.g., entities, names, concepts) in the sentence as prior knowledge; (2) learn to identify more local phrases with the guidance from extracted phrases; (3) learn to connect local phrases to form high-level structures. We treat randomly masked tokens in sentences as silver labels, and incorporate phrase information into structured language models (LM) for unsupervised constituency parsing. The proposed phrase-regularized warm-up and phrase-aware masked language modeling improve both local and high-level structure parsing, and establish the new SOTA for LM-based constituency parsing. 3. Representative Headline Generation. Beyond syntactic knowledge extraction, we generalize the idea of annotation-free mining to automatically organize and summarize documents based on semantic knowledge. With the real need from news readers to efficiently consume overwhelming daily news, we develop NHNet, a self-supervised model to generate concise and high-quality headlines for news stories. Without human annotation, we propose a three-level pre-training framework to fully leverage silver labels from web-scale news data for model training. We show that our model outperforms supervised generation model trained on human labels collected for years, and shows even stronger performance with finetuning on a small amount of human labels. 4. Information-guided Document Summarization. We then demonstrate how to incorporate phrase and sentence-level knowledge into self-supervised generation models for more accurate and interpretable document summarization. We propose EASum, a two-stage framework for unsupervised abstractive summarization. We train an information-guided generation model to generate fluent sentences covering specified phrases and events with silver labels mined from unlabeled documents. The model then generates a summary for each document with key phrases and events extracted from the target document. We show that key phrases and parsing-based events improve the accuracy of generated summaries, and also leave a clear clue for understanding the source of the generated summary. Together, the developed methods form a powerful annotation-free framework for multi-level text mining. All models mentioned above are open sourced for public usage, and some have been deployed in real-world production systems.
- Graduation Semester
- 2022-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Xiaotao Gu
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…