Withdraw
Loading…
Automatic rare disease extraction based on large language models
Cao, Lang
Loading…
Permalink
https://hdl.handle.net/2142/124192
Description
- Title
- Automatic rare disease extraction based on large language models
- Author(s)
- Cao, Lang
- Issue Date
- 2024-04-05
- Director of Research (if dissertation) or Advisor (if thesis)
- Sun, Jimeng
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- AI for Healthcare
- Large Language Model
- Natural Language Processing
- Abstract
- Identifying and extracting information on rare diseases is crucial in various medical contexts. However, mature rare disease extraction methods are lacking in low-resource settings. In this paper, we aim to create an end-to-end system called AutoRD, which automates extracting information from clinical text about rare diseases. We achieve this using large language models and medical knowledge graphs developed from open-source medical ontologies. Large language models (LLMs) aid in language analysis, while knowledge graphs provide content-specific facts, thus filling in any information gaps. Our system, AutoRD, is a software pipeline involving data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. Large language models and open-source data are leveraged throughout the pipeline. We have conducted various tests to evaluate the performance of AutoRD and highlighted its strengths and limitations in this paper. We quantitatively evaluate our system in terms of entity extraction, relation extraction, and the performance of knowledge graph construction. AutoRD achieves an overall F1 score of 47.3%, an improvement of 0.8% compared to the fine-tuned model, and a 14.4% improvement compared to the base LLM. In detail, AutoRD achieves an overall entity extraction F1 score of 56.1% (rare_disease: 83.5%, disease: 35.8%, symptom_and_sign: 46.1%, anaphor: 67.5%) and an overall relation extraction F1 score of 38.6% (produces: 34.7%, increases_risk_of: 12.4%, is_a: 37.4%, is_acronym: 44.1%, is_synonym: 16.3%, anaphora: 57.5%). Our qualitative experiment also demonstrates that the performance in constructing the knowledge graph is commendable. Several designs, including the incorporation of ontologies-enhanced LLMs, contribute to the improvement of AutoRD. AutoRD demonstrates superior performance compared to other methods, demonstrating the potential of LLM applications in rare disease detection and AI for healthcare.
- Graduation Semester
- 2024-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2024 Lang Cao
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…