Cross-lingual entity extraction and linking for 300 languages

Pan, Xiaoman

Cross-lingual entity extraction and linking for 300 languages

Pan, Xiaoman

Permalink

https://hdl.handle.net/2142/109431

Description

Title

Cross-lingual entity extraction and linking for 300 languages

Author(s)

Pan, Xiaoman

Issue Date

2020-12-03

Director of Research (if dissertation) or Advisor (if thesis)

Ji, Heng

Doctoral Committee Chair(s)

Ji, Heng

Committee Member(s)

Han, Jiawei
Tong, Hanghang
Knight, Kevin

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

cross-lingual
entity extraction
entity linking

Abstract

Information provided in languages that people can understand saves lives in crises. For example, the language barrier was one of the main difficulties faced by humanitarian workers responding to the Ebola crisis in 2014. We propose to break language barriers by extracting information (e.g., entities) from a massive variety of languages and ground the information into an existing Knowledge Base (KB) which is accessible to a user in their own language (e.g., a reporter from the World Health Organization who speaks English only). The ambitious goal of this thesis is to develop a Cross-lingual Entity Extraction and Linking framework for 1,000 fine-grained entity types and 300 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify entity name mentions, assign a fine-grained type to each mention, and link it to an English KB if it is linkable. Traditional entity linking methods rely on costly human-annotated data to train supervised learning-to-rank models to select the best candidate entity for each mention. In contrast, we propose a novel unsupervised represent-and-compare approach that can accurately capture the semantic meaning representation of each mention, and directly compare its representation with the representation of each candidate entity in the target KB. First, we leverage a deep symbolic semantic representation of the Abstract Meaning Representation to represent contextual properties of mentions. Then we enrich the representation of each contextual word and entity mention with a novel distributed semantic representation based on cross-lingual joint entity and word embedding. We develop a novel method to generate cross-lingual data that is a mix of entities and contextual words based on Wikipedia. This distributed semantics enables effective entity extraction and linking. Because the joint entity and word embedding space is constructed across languages, we further extend it to all 300 Wikipedia languages and fine-grained entity extraction and linking for 1,000 entity types defined in YAGO. Finally, using knowledge-driven question answering as a case study, we demonstrate the effectiveness of acquiring external knowledge using entity extraction and linking to improve downstream applications.

Graduation Semester

2020-12

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/109431

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Cross-lingual entity extraction and linking for 300 languages

Pan, Xiaoman

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In