Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

Mujahid, Muhammad; Kına, Erol; Rustam, Furqan; Villar, Mónica Gracia; Alvarado, Eduardo Silva; Díez, Isabel de la Torre; Ashraf, Imran

doi:10.1186/s40537-024-00943-4

articleJournal Of Big DataJun 17, 2024GOLD OA

Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

MMMuhammad Mujahid EKErol Kına FRFurqan Rustam MGMónica Gracia Villar ESEduardo Silva Alvarado

Prince Sultan University · Van Yüzüncü Yıl Üniversitesi · +9 more institutions

Indexed incrossrefdoaj

Abstract

Abstract The classification of imbalanced datasets is a prominent task in text mining and machine learning. The number of samples in each class is not uniformly distributed; one class contains a large number of samples while the other has a small number. Overfitting of the model occurs as a result of imbalanced datasets, resulting in poor performance. In this study, we compare different oversampling techniques like synthetic minority oversampling technique (SMOTE), support vector machine SMOTE (SVM-SMOTE), Border-line SMOTE, K-means SMOTE, and adaptive synthetic (ADASYN) oversampling to address the issue of imbalanced datasets and enhance the performance of machine learning models. Preprocessing significantly…

Citation impact

142

total citations

FWCI: 44.69
Percentile: 100%
References: 50

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Computer science
Oversampling
Computational Science and Engineering
Feature (linguistics)
Feature engineering
Machine learning
Artificial intelligence
Science and engineering

No related works found for this paper.

Funding

PS
Prince Sultan University
Award: 11586