Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets

Li, Jiahang; Guo, Shuxia; Ma, Rulin; He, Jia; Zhang, Xianghui; Rui, Dongsheng; Ding, Yusong; Li, Yu; Jian, Le-yao; Cheng, Jing; Guo, Heng

doi:10.1186/s12874-024-02173-x

articleBMC Medical Research MethodologyFeb 16, 2024GOLD OA

Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets

JLJiahang Li SGShuxia Guo RMRulin Ma JHJia He XZXianghui Zhang

Shihezi University · Xinjiang Production and Construction Corps

PubMed

Indexed incrossrefdoajpubmed

Abstract

Background

Missing data is frequently an inevitable issue in cohort studies and it can adversely affect the study's findings. We assess the effectiveness of eight frequently utilized statistical and machine learning (ML) imputation methods for dealing with missing data in predictive modelling of cohort study datasets. This evaluation is based on real data and predictive models for cardiovascular disease (CVD) risk.

Methods

The data is from a real-world cohort study in Xinjiang, China. It includes personal information, physical examination data, questionnaires, and laboratory biochemical results from 10,164 subjects with a total of 37 variables. Simple imputation (Simple), regression imputation (Regression), expectation-maximization(EM), multiple imputation (MICE) , K nearest neighbor classification (KNN), clustering imputation (Cluster), random forest (RF), and decision tree (Cart) were the chosen imputation methods. Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are utilised to assess the performance of different methods for missing data imputation at a missing rate of 20%. The datasets processed with different missing data imputation methods were employed to construct a CVD risk prediction model utilizing the support vector machine (SVM). The predictive performance was then compared using the area under the curve (AUC).

Citation impact

114

total citations

FWCI: 70.88
Percentile: 100%
References: 32

Citations per year

Authors

11

Topics & keywords

Topics

Keywords

Imputation (statistics)
Missing data
Statistics
Mean squared error
Random forest
Cart
Regression
Cluster analysis

No related works found for this paper.