Learning from multiple heterogeneous sources - Handling source trustworthiness and incompleteness

Zhi, Shi

Learning from multiple heterogeneous sources - Handling source trustworthiness and incompleteness

Zhi, Shi

Permalink

https://hdl.handle.net/2142/109527

Description

Title

Learning from multiple heterogeneous sources - Handling source trustworthiness and incompleteness

Author(s)

Zhi, Shi

Issue Date

2020-12-04

Director of Research (if dissertation) or Advisor (if thesis)

Han, Jiawei

Doctoral Committee Chair(s)

Han, Jiawei

Committee Member(s)

Zhai, Chengxiang
Peng, Jian
Tang, Jiliang

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Truth Discovery
Complimentary Learning

Abstract

We are living in a world with an explosive size of data. For a single application, usually there is more than one information source available. Making use of multiple sources can improve the quality of the labels and extend the size of the training data. However, these information sources have their own properties, and cannot be directly combined and utilized. We study the source heterogeneity from two major aspects, i.e. (1) heterogeneous quality (2) heterogeneous label spaces. Firstly, when integrating information from multiple sources, it is common to encounter conflicting answers to the same question. Truth discovery is to infer the most accurate and complete integrated answers from conflicting sources with heterogeneous source quality. In some cases, there exist questions for which the true answers are excluded from the candidate answers provided by all sources. Without any prior knowledge, these questions, named no-truth questions, are difficult to be distinguished from the questions that have true answers, named has-truth questions. In particular, these no-truth questions degrade the precision of the answer integration system. We address such a challenge by introducing source quality, which is made up of three fine-grained measures: silent rate, false spoken rate and true spoken rate. By incorporating these three measures, we propose a probabilistic graphical model, which simultaneously infers truth as well as source quality without any a priori training involving ground truth answers. Moreover, since inferring this graphical model requires parameter tuning of the prior of truth, we propose an initialization scheme based upon a quantity named truth existence score, which synthesizes two indicators, namely, participation rate and consistency rate. Compared with existing methods, our method can effectively filter out no-truth questions, which results in more accurate source quality estimation. Consequently, our method provides more accurate and complete answers to both has-truth and no-truth questions. Experiments on three real-world datasets illustrate the notable advantage of our method over existing state-of-the-art truth discovery methods. Moreover, we study the truth discovery problem in the truth evolution setting. In many real-life scenarios, the latent true value often keeps changing dynamically over time instead of staying static. We study the dynamic truth discovery problem in the space of numerical truth discovery. This problem cannot be addressed by existing models because of the new challenges of capturing time-evolving source dependency in a continuous space as well as handling missing data on the fly. We propose a model named EvolvT for dynamic truth discovery on numerical data. With the hidden Markov framework, EvolvT captures three key aspects of dynamic truth discovery with a unified model: truth transition regularity, source quality, and source dependency. The most distinguishable feature of the modeling part of EvolvT is that it employs Kalman filtering to model truth evolution. As such, EvolvT not only can principally infer source dependency in a continuous space, but also can handle missing data in a natural way. We establish an expectation-maximization (EM) algorithm for parameter inference of EvolvT and present an efficient online version for the parameter inference procedure. Our experiments on real-world applications demonstrate its advantages over the state-of-the-art truth discovery approaches. We study the second heterogeneous label spaces problem in the sequential labeling task. We aim to train a unified Named Entity Recognition (NER) model with annotations from multiple sources. Even for datasets from the same domain, annotations from different sources cover different sets of entity types. Such inconsistency makes it omnipresent to treat them as different tasks, and no existing methods, to the best of our knowledge, can construct a single model to extract entities of any type covered by disparate datasets. Here, we refer to such tasks as proto-NER and present complementary learning to train with only partial annotations. For this purpose, we explore not only heuristic but also end-to-end learning approaches. Specifically, we transform original one-hot labels into fuzzy labels while preserving the original information. We further propose the fuzzy conditional random field that takes fuzzy labels as supervision and spontaneously integrates label spaces of different corpora. Extensive experiments demonstrate the efficacy of complementary learning and the superiority of the proposed end-to-end approach. Though we already address the issue of incompleteness, the issue of the inaccuracy of the original labels remains challenging. To further investigate the heterogeneous label spaces problem, we propose a new framework called cross self-training to utilize the label prediction on the missing labels and the partial labels under complementary learning setting. We propose two variants, iterative weighted combination and sliding window weighted combination to train the model and update the labels in an alternating manner. Using real-world dataset, we find that with sufficiently accurate original labels of the training datasets, we can improve the overall accuracy of the trained model.

Graduation Semester

2020-12

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/109527

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Learning from multiple heterogeneous sources - Handling source trustworthiness and incompleteness

Zhi, Shi

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In