Withdraw
Loading…
Domain-agnostic named entity recognition on unstructured text
Arora, Jatin
Loading…
Permalink
https://hdl.handle.net/2142/110555
Description
- Title
- Domain-agnostic named entity recognition on unstructured text
- Author(s)
- Arora, Jatin
- Issue Date
- 2021-04-26
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- named entity recognition
- ner
- information extraction
- knowledge extraction
- deep learning
- bert
- biomedical entity extraction
- general domain ner
- entity chunking
- entity identification
- word patterns
- sequence labeling
- question answering
- span detection
- phrase detection
- entity typing
- phrase classification
- span classification
- mention detection
- natural language processing
- nlp
- text mining
- data mining
- machine learning
- neural networks
- Abstract
- Named Entity Recognition (NER) is the task of extracting informing entities belonging to predefined semantic classes from raw text. These semantic classes could be general-purpose like a person, location or domain-specific like genes, protein names in biomedical texts. NER has widespread applications in natural language processing (NLP) and serves as the foundation for applications like question answering, information retrieval and machine translation. Recently, the NER task has got a lot of traction in the research community with the advent of deep learning models like BERT which are able to capture textual semantics very well. In this work, we present a detailed study approaching the NER task from three different perspectives, namely, sequence labeling, question answering (QA), and span-based classification. We propose a simple span detection and classification pipeline that first detects all mention spans irrespective of entity type and then feeds each mention span as input to a model and expects entity type as output. This setup is the reverse of a traditional QA-based NER system where we feed entity type as input and expect mention spans as output. We also introduce explicit pattern embeddings which compliment character embeddings to learn better word representations even with less training data. Experimental results demonstrate the effectiveness of our proposed domain-agnostic techniques on multiple datasets. We set the new state-of-the-art for BioNLP13CG and give a competitive performance on CoNLL 2003 and JNLPBA datasets. Additionally, we probe into the BERT model and show that mere concatenation of external feature vectors with BERT outputs may not train effectively at the recommended low learning rates for BERT. More sophisticated feature fusion is essential.
- Graduation Semester
- 2021-05
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/110555
- Copyright and License Information
- Copyright 2021 Jatin Arora
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…