Extracting structured data from Web pages

Arasu, Arvind; García-Molina, Héctor

doi:10.1145/872757.872799

articleJun 9, 2003Closed access

Extracting structured data from Web pages

AAArvind Arasu HGHéctor García-Molina

Palo Alto University · Stanford University

Indexed incrossref

Abstract

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate…

Citation impact

701

total citations

FWCI: 112.43
Percentile: 100%
References: 24

Citations per year

Authors

2

Topics & keywords

Topics

Keywords

Computer science
Web page
Information retrieval
Set (abstract data type)
HITS algorithm
Static web page
Data mining
World Wide Web

No related works found for this paper.

Funding

NS
National Science Foundation