Withdraw
Loading…
Cross-lingual entity extraction and linking for 300 languages
Pan, Xiaoman
Loading…
Permalink
https://hdl.handle.net/2142/109431
Description
- Title
- Cross-lingual entity extraction and linking for 300 languages
- Author(s)
- Pan, Xiaoman
- Issue Date
- 2020-12-03
- Director of Research (if dissertation) or Advisor (if thesis)
- Ji, Heng
- Doctoral Committee Chair(s)
- Ji, Heng
- Committee Member(s)
- Han, Jiawei
- Tong, Hanghang
- Knight, Kevin
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- cross-lingual
- entity extraction
- entity linking
- Abstract
- Information provided in languages that people can understand saves lives in crises. For example, the language barrier was one of the main difficulties faced by humanitarian workers responding to the Ebola crisis in 2014. We propose to break language barriers by extracting information (e.g., entities) from a massive variety of languages and ground the information into an existing Knowledge Base (KB) which is accessible to a user in their own language (e.g., a reporter from the World Health Organization who speaks English only). The ambitious goal of this thesis is to develop a Cross-lingual Entity Extraction and Linking framework for 1,000 fine-grained entity types and 300 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify entity name mentions, assign a fine-grained type to each mention, and link it to an English KB if it is linkable. Traditional entity linking methods rely on costly human-annotated data to train supervised learning-to-rank models to select the best candidate entity for each mention. In contrast, we propose a novel unsupervised represent-and-compare approach that can accurately capture the semantic meaning representation of each mention, and directly compare its representation with the representation of each candidate entity in the target KB. First, we leverage a deep symbolic semantic representation of the Abstract Meaning Representation to represent contextual properties of mentions. Then we enrich the representation of each contextual word and entity mention with a novel distributed semantic representation based on cross-lingual joint entity and word embedding. We develop a novel method to generate cross-lingual data that is a mix of entities and contextual words based on Wikipedia. This distributed semantics enables effective entity extraction and linking. Because the joint entity and word embedding space is constructed across languages, we further extend it to all 300 Wikipedia languages and fine-grained entity extraction and linking for 1,000 entity types defined in YAGO. Finally, using knowledge-driven question answering as a case study, we demonstrate the effectiveness of acquiring external knowledge using entity extraction and linking to improve downstream applications.
- Graduation Semester
- 2020-12
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/109431
- Copyright and License Information
- Copyright 2020 Xiaoman Pan
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…