Clustering by Compression
Data61 · Amsterdam University of the Arts · +1 more institution
Abstract
We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of…
Citation impact
- FWCI
- 82.68
- Percentile
- 100%
- References
- 49
Authors
2Topics & keywords
- Kolmogorov complexity
- Cluster analysis
- Computer science
- Distance matrix
- Hierarchical clustering
- Data compression
- Pairwise comparison
- Theoretical computer science
- Quality Education