articleBMC BioinformaticsMay 31, 2022GOLD OA

Statistical power for cluster analysis

MRC Cognition and Brain Sciences Unit · University of Cambridge

PubMed
Indexed inarxivcrossrefdatacitedoajpubmed

Abstract

Background

Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and classification accuracy for common analysis pipelines through simulation. We systematically varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we directly compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent class analysis and latent profile analysis).

Results

We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N = 20 per subgroup), provided cluster separation is large (Δ = 4). Finally, we demonstrated that fuzzy clustering can provide a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ = 3).

Citation impact

525
total citations
FWCI
62.84
Percentile
100%
References
59
Citations per year

Authors

3

Topics & keywords

Keywords
  • Cluster analysis
  • Hierarchical clustering
  • Sample size determination
  • Principal component analysis
  • Covariance
  • Dimensionality reduction
  • Computer science
  • Statistical power
No related works found for this paper.

Funding