Withdraw
Loading…
Microbial named entity recognition using BERT models
Rao, Brian K
Loading…
Permalink
https://hdl.handle.net/2142/115955
Description
- Title
- Microbial named entity recognition using BERT models
- Author(s)
- Rao, Brian K
- Issue Date
- 2022-07-21
- Director of Research (if dissertation) or Advisor (if thesis)
- Kilicoglu, Halil
- Department of Study
- Information Sciences
- Discipline
- Bioinformatics
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Bioinformatics
- Abstract
- Bacteria are critical subjects of microbiological research that span many rapidly-growing fields of study. The development and widespread application of high-throughput sequencing has led to more microbial data being collected in recent years than ever before. This study investigates the capabilities of the popular Natural Language Processing (NLP) model Bidirectional Encoder Representations from Transformers (BERT) on the relatively understudied text mining domain of microbiology. This is done by fine-tuning a variety of BERT models (BERT, DistilBERT, SciBERT, BioBERT, PubMedBERT) on the Bacteria Biotope 2019 Open Shared Task (BB2019-OST) corpus of annotated microbial research text and evaluating the best performing models on the Named Entity Recognition (NER) task. Following this, an in-depth error analysis was conducted to gain insights into BERT’s entity recognition capabilities. Finally, to investigate performance capabilities further, learning rate and batch size hyperparameters were tuned to increase F1-score. The best BERT model in the comparison was BioBERT, earning an F1-score of 73.82 (±1.04) with default hyperparameters, and 75.35 (±0.62) with tuned hyperparameters. BioBERT had better F1-scores and entity-level statistics, despite PubMedBERT ranking high in biomedical NLP benchmarks. This suggests that the generality of the pretraining corpora of BERT models is particularly important for text mining in the microbial domain.
- Graduation Semester
- 2022-08
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Brian Rao
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…