Predicting disease risks from highly imbalanced data using random forest

Khalilia, Mohammed; Chakraborty, Sounak; Popescu, Mihail

doi:10.1186/1472-6947-11-51

articleBMC Medical Informatics and Decision MakingJul 29, 2011GOLD OA

Predicting disease risks from highly imbalanced data using random forest

MKMohammed Khalilia SCSounak Chakraborty MPMihail Popescu

University of Missouri · University of Missouri Health System

PubMed

Indexed incrossrefdoajpubmed

Abstract

Background

We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.

Methods

We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.

Citation impact

728

total citations

FWCI: 19.78
Percentile: 100%
References: 28

Citations per year

Authors

3

Topics & keywords

Topics

Keywords

Random forest
Boosting (machine learning)
Healthcare Cost and Utilization Project
Computer science
Support vector machine
Ensemble learning
Machine learning
Artificial intelligence

UN Sustainable Development Goals

Life in Land

No related works found for this paper.

Funding

UO
University of Missouri