Withdraw
Loading…
Towards fine-grained automated age extraction for precision medicine
Salvi, Rohan Charudatt
This item's files can only be accessed by the Administrator group.
Permalink
https://hdl.handle.net/2142/121277
Description
- Title
- Towards fine-grained automated age extraction for precision medicine
- Author(s)
- Salvi, Rohan Charudatt
- Issue Date
- 2023-07-18
- Director of Research (if dissertation) or Advisor (if thesis)
- Blake, Catherine
- Committee Member(s)
- Bosch, Nigel
- Department of Study
- Information Sciences
- Discipline
- Information Management
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Natural Language Processing
- Precision Medicine
- Information Extraction
- Abstract
- Precision medicine (PM) aims to customize interventions for individuals. Although genetic factors have received significant attention, non-genetic population characteristics such as age, race, and gender remain underutilized due to challenges in extracting information at the right level of specificity from the literature. Our goal is to automatically identify age, where authors use either narrative (e.g., adult, child) or numbers where the latter is further delineated into mean, median, minimum, maximum (e.g., 20 and 60 respectively from the text aged 20-60), standard deviation, and upper and lower bounds (e.g., >60, less than 50). To address this issue, we conducted 2 experiments. In the first experiment, we used three information extraction methods (rule-based, extractive question-answering, and conditional generation) to identify age at the level of detail needed by PM. We then conduct experiments using an existing evidence-based medicine natural language processing (EBM-NLP) dataset (that we have augmented to include the extra detail) and introduce a new dataset focused on breast cancer, where the age at diagnosis significantly impacts intervention possibilities. Extractive question answering consistently outperformed the other techniques, achieving the highest F1-scores of 0.988, 0.961, 0.991, 0.994, and 0.667 for mean, median, minimum, maximum, and standard deviation, respectively, in the breast cancer dataset. Our analysis also revealed 200 missing values in EBM-NLP and that training models for each facet rely heavily on having a substantial amount of training data and computational resources. The second experiment addresses this limitation, where we explore information extraction active learning. We curated a new breast cancer data set and conducted a comparative study on conditional random fields and bidirectional long short-term memory-conditional random field (Bi-LSTM-CRF) models and sampling techniques. Our findings reveal that the Bi-LSTM-CRF model, coupled with random sampling, outperforms other approaches, achieving F1-scores of 0.653, 0.719, 0.924, and 0.977 for mean, median, minimum, and maximum, respectively. The active learning approach did not work for standard deviation. Overall, these results demonstrate that automated methods can identify age and the latter approach is promising to identify other regarding non-genetic factors, for precision medicine. This thesis contributes to the fields of precision medicine and information science by providing publicly available datasets, including a new breast cancer dataset and a revised EBM-NLP dataset that includes the level of specificity required by PM, to enable others to hone their automated methods.
- Graduation Semester
- 2023-08
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2023 Rohan Charudatt Salvi
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…