articleNov 23, 2002Closed access
On the resemblance and containment of documents
Digital Science (United States)
Indexed incrossref
Abstract
Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints.
Citation impact
1,700
total citations
- FWCI
- 40.50
- Percentile
- 100%
- References
- 11
Citations per year
Authors
1Topics & keywords
Topics
Keywords
- Containment (computer programming)
- Intersection (aeronautics)
- Computer science
- Process (computing)
- Sampling (signal processing)
- Set (abstract data type)
- Sample (material)
- Theoretical computer science
No related works found for this paper.