Withdraw
Loading…
Character language models for generalization of multilingual named entity recognition
Yu, Xiaodong
Content Files

Loading…
Download Files
Loading…
Download Counts (All Files)
Loading…
Edit File
Loading…
Permalink
https://hdl.handle.net/2142/104934
Description
- Title
- Character language models for generalization of multilingual named entity recognition
- Author(s)
- Yu, Xiaodong
- Issue Date
- 2019-04-25
- Director of Research (if dissertation) or Advisor (if thesis)
- Roth, Dan
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Date of Ingest
- 2019-08-23T20:02:10Z
- Keyword(s)
- Character Language Models
- Named Entity Recognition
- Generalization
- Multilingual
- Multilingual Named Entity Recognition
- NER
- Abstract
- "State-of-the-art Named Entity Recognition (NER) models usually achieve high performance on entities that they have seen in training data, but a significantly lower performance on unseen entities. This is one of the key reasons in performance degradation observed when NER models are evaluated on new domains. Motivated by this observation, quantified for the first time in this thesis, we study an improved, multi-domain and multi-lingual, capability for identifying \what is a name"". Character-level patterns have been widely used as features in English Named Entity Recognition (NER) systems. However, to date there has been no direct investigation of the inherent differences between name and non-name tokens in text, nor whether this property holds across multiple languages. The key contribution of this thesis is to develop a Character-level Language Model (CLM) that, as we show, allow us to better learn \what is a name"". We analyze the capabilities of corpus-agnostic Character-level Language Models (CLMs) in the binary task of distinguishing name tokens from non-name tokens and demonstrate that CLMs provide a simple yet powerful model for capturing these differences. Specifically, we show that it can identify named entity tokens in a diverse set of languages at close to the performance of full NER systems. Moreover, by adding very simple CLM-based features we can significantly improve the performance of an o -the-shelf NER system for multiple languages."
- Graduation Semester
- 2019-05
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/104934
- Copyright and License Information
- Copyright 2019 Xiaodong Yu
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Siebel School of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…