Statistical power for cluster analysis

Dalmaijer, Edwin S.; Nord, Camilla L.; Astle, Duncan E.

doi:10.1186/s12859-022-04675-1

articleBMC BioinformaticsMay 31, 2022GOLD OA

Statistical power for cluster analysis

ESEdwin S. Dalmaijer CLCamilla L. Nord DEDuncan E. Astle

MRC Cognition and Brain Sciences Unit · University of Cambridge

PubMed

Indexed inarxivcrossrefdatacitedoajpubmed

Abstract

Background

Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and classification accuracy for common analysis pipelines through simulation. We systematically varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we directly compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent class analysis and latent profile analysis).

Results

We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N = 20 per subgroup), provided cluster separation is large (Δ = 4). Finally, we demonstrated that fuzzy clustering can provide a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ = 3).

Citation impact

525

total citations

FWCI: 62.84
Percentile: 100%
References: 59

Citations per year

Authors

3

Topics & keywords

Topics

Keywords

Cluster analysis
Hierarchical clustering
Sample size determination
Principal component analysis
Covariance
Dimensionality reduction
Computer science
Statistical power

No related works found for this paper.