Truth finding in databases

Zhao, Bo

Truth finding in databases

Zhao, Bo

Permalink

https://hdl.handle.net/2142/42470

Description

Title

Truth finding in databases

Author(s)

Zhao, Bo

Issue Date

2013-02-03T19:46:38Z

Director of Research (if dissertation) or Advisor (if thesis)

Han, Jiawei

Doctoral Committee Chair(s)

Han, Jiawei

Committee Member(s)

Zhai, ChengXiang
Roth, Dan
Yu, Philip S.

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

data integration
truth finding
data fusion
data quality
entity matching
data mining
probabilistic graphical models

Abstract

In practical data integration systems, it is common for the data sources being integrated to provide conflicting information about the same entity. Consequently, a major challenge for data integration is to derive the most complete and accurate integrated records from diverse and sometimes conflicting sources. We term this challenge the truth finding problem. We observe that some sources are generally more reliable than others, and therefore a good model of source quality is the key to solving the truth finding problem. In this thesis, we propose probabilistic models that can automatically infer true records and source quality without any supervision on both categorical data and numerical data. We further develop a new entity matching framework that considers source quality based on truth-finding models. On categorical data, in contrast to previous methods, our principled approach leverages a generative process of two types of errors (false positive and false negative) by modeling two different aspects of source quality. In so doing, ours is also the first approach designed to merge multi-valued attribute types. Our method is scalable, due to an efficient sampling-based inference algorithm that needs very few iterations in practice and enjoys linear time complexity, with an even faster incremental variant. Experiments on two real world datasets show that our new method outperforms existing state-of-the-art approaches to the truth finding problem on categorical data. While in practice, numerical data is not only ubiquitous but also of high value, e.g. price, weather, census, polls and economic statistics. Quality issues on numerical data can also be even more common and severe than categorical data due to its characteristics. Therefore, in this thesis we propose a new truth-finding method specially designed for handling numerical data. Based on Bayesian probabilistic models, our method can leverage the characteristics of numerical data in a principled way, when modeling the dependencies among source quality, truth, and claimed values. Experiments on two real world datasets show that our new method outperforms existing state-of-the-art approaches in both effectiveness and efficiency. We further observe that modeling source quality not only can help decide the truth but also can help match entities across different sources. Therefore, as a natural next step, we integrate truth finding with entity matching so that we could infer matching of entities, true attributes of entities and source quality in a joint fashion. This is the first entity matching approach that involves modeling source quality and truth finding. Experiments show that our approach can outperform state-of-the-art baselines.

Graduation Semester

2012-12

Permalink

http://hdl.handle.net/2142/42470

Copyright and License Information

Owning Collections

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Truth finding in databases

Zhao, Bo

Permalink

Description

Owning Collections

Dissertations and Theses - Computer Science

Graduate Dissertations and Theses at Illinois PRIMARY

Log In