SMOTE for high-dimensional class-imbalanced data

Blagus, Rok; Lusa, Lara

doi:10.1186/1471-2105-14-106

articleBMC BioinformaticsMar 22, 2013GOLD OA

SMOTE for high-dimensional class-imbalanced data

RBRok Blagus LLLara Lusa

University of Ljubljana

PubMed

Indexed incrossrefdoajpubmed

Abstract

Background

Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.

Results

While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.

Citation impact

1,078

total citations

FWCI: 16.80
Percentile: 100%
References: 47

Citations per year

Authors

2

Topics & keywords

Topics

Keywords

Undersampling
Oversampling
Random forest
Computer science
Class (philosophy)
Artificial intelligence
Clustering high-dimensional data
Machine learning

No related works found for this paper.