A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases

He, Bin; Li, Chengkai; Killian, David; Patel, Mitesh; Tseng, Yuping; Chang, Kevin Chen-Chuan

A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases

He, Bin; Li, Chengkai; Killian, David; Patel, Mitesh; Tseng, Yuping; Chang, Kevin Chen-Chuan

Permalink

https://hdl.handle.net/2142/11235

Description

Title

A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases

Author(s)

He, Bin
Li, Chengkai
Killian, David
Patel, Mitesh
Tseng, Yuping
Chang, Kevin Chen-Chuan

Issue Date

2006-07

Keyword(s)

database

Abstract

"The Web has been rapidly ``deepened"" by massive databases online: Recent surveys show that while the surface Web has linked billions of static HTML pages, a far more significant amount of information is ``hidden"" in the deep Web, behind the query forms of searchable databases. With its myriad databases and hidden content, this deep Web is an important frontier for information search. In this paper, we develop a novel Web Form Crawler to collect the ``doors"" of Web databases, i.e., query forms, to build a database for online databases in both efficient and comprehensive manners. Being object-focused, topic-neutral and coverage-comprehensive, such a crawler, while critical to searching and integrating online databases, has not been extensively studied. In particular, query forms, while many, when compared with the size of the Web, are sparsely scattered among pages, which brings new challenges for focused crawling: First, due to the topic-neutral nature of our crawling problem, we cannot rely on existing topic-focused crawling techniques. Second, existing focused crawling cannot achieve the comprehensiveness requirement because it is not able to be aware of the coverage of crawled content. As a new attempt, we propose a structure-driven crawling framework by observing structure locality of query forms-- That is, query forms are often close to root pages of Web sites and accessible by following navigational links. Exploring this structure locality, we substantiate the structure-driven crawling framework into a site-based Web Form Crawler by first collecting the site entrances, as the Site Finder, and then searching for query forms within the scope of each site, as the Form Finder. Analytical justification and empirical evaluation of the Web Form Crawler both show that: 1) our crawler can maintain stable harvest and coverage throughout the crawling, and 2) compared to page-based crawling, our best harvest rate is about 10 to 400 times better, depending on the page traversal schemes used."

Type of Resource

text

Permalink

http://hdl.handle.net/2142/11235

Copyright and License Information

You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).

A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases

He, Bin; Li, Chengkai; Killian, David; Patel, Mitesh; Tseng, Yuping; Chang, Kevin Chen-Chuan

Permalink

Description

Owning Collections

Research and Tech Reports - Computer Science PRIMARY

A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases

He, Bin; Li, Chengkai; Killian, David; Patel, Mitesh; Tseng, Yuping; Chang, Kevin Chen-Chuan

Permalink

Description

Owning Collections

Research and Tech Reports - Computer Science PRIMARY

Log In