Withdraw
Loading…
Making Holistic Schema Matching Robust: An Ensemble Framework with Sampling and Voting
He, Bin; Chang, Kevin Chen-Chuan
Loading…
Permalink
https://hdl.handle.net/2142/10881
Description
- Title
- Making Holistic Schema Matching Robust: An Ensemble Framework with Sampling and Voting
- Author(s)
- He, Bin
- Chang, Kevin Chen-Chuan
- Issue Date
- 2004-07
- Keyword(s)
- Database Web mining
- Abstract
- "With the prevalence of databases on the Web, \emph{large scale} integration has become a pressing problem. As an essential task, \emph{holistic schema matching} (i.e., discovering attribute correspondences among many schemas) has been actively studied recently. As a ``data mining"" approach in nature, holistic schema matching, on one hand, benefits from the large scale of input schema data, while on the other hand, also suffers the problem of noises. Such noises often inevitably arise in the automatic extraction of schema data, which is mandatory in large scale integration. For holistic matching to be viable, it is thus essential to make it robust against noisy schemas. Toward this goal, we propose a novel ``ensemble"" framework, which aggregates a multitude of base holistic matchers to achieve robustness, by exploiting statistical sampling and majority voting: To begin with, we observe that Web query interfaces possess two interesting characteristics: 1) ``redundancy of attributes""-- that schemas tend to share attributes, and 2) ``non- dominance of noises""-- that noisy schemas are relatively few. These observations inspire us to develop a generic \emph {ensemble} framework, which consists of \emph{multiple sampling}, \emph{ranking aggregation} and \emph{matching selection}. In essence, our approach creates an ensemble of base holistic matchers, by randomizing the schema data into many \emph{trials} and aggregating their ranked results by taking majority voting. We provide analytic justification of the robustness of the ensemble. Empirically, our experiments show that the ``ensemblization"" indeed significantly boosts the matching accuracy, over automatically extracted schema data."
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/10881
- Copyright and License Information
- You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Owning Collections
Manage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…