Withdraw
Loading…
NumNER: Numerical named entity recognition in scientific literature
Garg, Shweta
Loading…
Permalink
https://hdl.handle.net/2142/117821
Description
- Title
- NumNER: Numerical named entity recognition in scientific literature
- Author(s)
- Garg, Shweta
- Issue Date
- 2022-12-05
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Named Entity Recognition
- NER
- Numerical Entities
- Numerical NER
- Quantities
- Units of Measurements
- SciBERT
- Entity Disambiguation
- Abstract
- In the scientific domain, scientists are dealing everyday with many numerical entities. For example, in the chemistry domain, it is very common to define an inorganic reaction with details about the weight or volume of the compounds, the temperature at which a reaction was carried out and the time it took to finish the reaction. Identifying such numerical entities is hence beneficial but we may not always have the data in a desired format free from ambiguities. For example, a numerical entity 10 m may have several meanings given the context in which it is used, such as - molarity entity defining 10 moles of a compound, length entity defining 10 meters in measurement or an abbreviation of the time entity i.e 10 minutes. Hence identifying and resolving such conflicts becomes a non-trivial task. We propose NumNER, a weakly-supervised Numerical Named Entity Recognition (NER) model that takes as input weak-supervision in the form of ontology and knowledge-base guided unit labels and unit prefixes dictionaries and performs numerical entity identification and classification on scientific corpus using a combination of rule-based and context-based methods. The model also performs entity disambiguation by leveraging the knowledge of SciBERT, a pre-trained language model that has been trained on a massive scientific corpus to understand the context of the surrounding words of the ambiguous numerical entity. The model is generalizable to any domain-specific units if the user can provide the labels and symbols for those units as weak supervision to the model. We compare the performance of our model with some popular baselines NER methods and analyze the results both quantitatively and qualitatively. We also show a real-world use case of the model for extracting numerical attributes for a patient in a clinical report.
- Graduation Semester
- 2022-12
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Shweta Garg
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…