Withdraw
Loading…
Scientific knowledge extraction from massive text data
Wang, Xuan
Loading…
Permalink
https://hdl.handle.net/2142/117759
Description
- Title
- Scientific knowledge extraction from massive text data
- Author(s)
- Wang, Xuan
- Issue Date
- 2022-11-16
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Doctoral Committee Chair(s)
- Han, Jiawei
- Committee Member(s)
- Ji, Heng
- Zhai, Chengxiang
- Lu, Zhiyong
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- scientific text mining
- information extraction
- weak supervision
- distant supervision
- multi-modal supervision
- fine-grained named entity recognition
- textual evidence mining
- scientific topic contrasting
- open information extraction
- knowledge graph construction
- literature search
- Abstract
- Text mining is promising for advancing human knowledge in many fields, given the rapidly growing volume of text data (e.g., news reports, scientific articles, and medical notes) we are seeing nowadays. Recently, there has been a growing interest in bringing text mining to scientific discovery in various domains, such as mining the biomedical literature and electronic health record for health care and biomedicine, mining the chemistry literature for molecular discovery and synthetic strategy designing, and mining the agriculture literature for agricultural resilience, management, and sustainability. We envision tremendous opportunities in this emerging area of advanced text mining for scientific discovery. This thesis focuses on developing effective and scalable text mining algorithms and systems to enable and accelerate scientific discovery. We primarily focus on two research directions: (1) scientific information extraction with weak supervisions, and (2) scientific knowledge discovery applications. • Scientific Information Extraction with Weak Supervisions: With the growing volume of text data and the breadth of information, it is inefficient or nearly impossible for humans to manually find, integrate, and digest useful information. A major challenge is to develop methods that automatically understand massive unstructured text data. To address this challenge, we have developed methods that extract information from text with minimal human supervision. We have contributed a series of algorithms and systems under three weak supervision scenarios: (1) pattern-enhanced weak supervision for scientific information extraction, (2) ontology-guided distant supervision for fine-grained information extraction, and (3) cross-modal supervision between text and graph. • Scientific Knowledge Discovery in Real World: With the advanced text mining methods developed, we future study how to enable and accelerate real-world knowledge discovery. We have been collaborating with experts in various science domains (e.g., biomedicine, chemistry, and health) to achieve this goal. Through the collaborations, we have developed algorithms and systems for two real-world applications: (1) scientific textual evidence discovery and (2) scientific topic contrasting. Our research benefits from and fosters collaborations with experts in various research areas within and beyond computer science from various institutions, including hospitals (UC Davis Medical Center), government (National Institute of Health and Army Research Lab), industry (IBM and Eli Lilly), and academics from other universities (Stanford, UCLA, UC Davis, UCSD, USC, Purdue, and Iowa State University). Our algorithms and systems can be generally used for any science domain where a knowledge discovery from massive text data is needed. Two examples in the health and chemistry domains are discussed below. • Clinical Domain: We have developed text mining methods to find proteins that are specifically associated with six main categories of heart diseases. Our top-ranked proteins match the knowledge of the clinical researchers very well. Some of our discovered proteins are currently under experimental validation by clinical researchers at the UC Davis Medical Center. This collaboration has a high potential to unveil novel therapeutic targets in patients and repurpose drugs already used in the clinic. • Chemistry Domain: We have also developed text mining methods to support an intelligent molecule discovery process in organic chemistry. We have been collaborating with the researchers in the Chemistry Department at UIUC, finding the most representative catalysts and reaction conditions by comparing different organic reaction types. This collaboration leads to AI-driven systems for automatic chemical/material synthesis plan generation and optimization. In summary, we tackle a series of technical challenges for automatically extracting a wide range of information from unstructured scientific text. We further address open scientific problems, such as clinical drug discovery and chemical and biological molecule design, based on the rich information we automatically extracted from the scientific text. However, there remain grand challenges for scientific text mining, such as a lack of specialized domain knowledge in a natural language context, multi-modal representations of scientific knowledge, and complex conditions associated with scientific information. In the future, we plan to tackle the above challenges by developing knowledge-enhanced, multi-modal, and condition-aware text mining approaches for scientific discovery.
- Graduation Semester
- 2022-12
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Xuan Wang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…