WebTables

Cafarella, Michael; Halevy, Alon; Wang, Daisy Zhe; Wu, Eugene; Zhang, Yang

doi:10.14778/1453856.1453916

articleProceedings of the VLDB EndowmentAug 1, 2008Closed access

WebTables

MCMichael Cafarella AHAlon Halevy DZDaisy Zhe Wang EWEugene Wu YZYang Zhang

University of Washington · Google (United States) · +4 more institutions

Indexed incrossref

Abstract

The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own "schema" of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WEBTABLES system to explore two fundamental questions about this collection of databases. First, what are…

Citation impact

634

total citations

FWCI: 110.28
Percentile: 100%
References: 43

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Computer science
Information retrieval
Information schema
Schema matching
Schema (genetic algorithms)
Database schema
Data mining
Semi-structured model

No related works found for this paper.