articleIEEE Transactions on Information TheoryApr 1, 2005Closed access

Clustering by Compression

Data61 · Amsterdam University of the Arts · +1 more institution

Indexed incrossref

Abstract

We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of…

Citation impact

1,190
total citations
FWCI
82.68
Percentile
100%
References
49
Citations per year

Authors

2

Topics & keywords

Keywords
  • Kolmogorov complexity
  • Cluster analysis
  • Computer science
  • Distance matrix
  • Hierarchical clustering
  • Data compression
  • Pairwise comparison
  • Theoretical computer science
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.

Funding