Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study
Massachusetts Institute of Technology · Brigham and Women's Hospital · +2 more institutions
Abstract
Large language models (LLMs) such as GPT-4 hold great promise as transformative tools in health care, ranging from automating administrative tasks to augmenting clinical decision making. However, these models also pose a danger of perpetuating biases and delivering incorrect medical diagnoses, which can have a direct, harmful impact on medical care. We aimed to assess whether GPT-4 encodes racial and gender biases that impact its use in health care.
Using the Azure OpenAI application interface, this model evaluation study tested whether GPT-4 encodes racial and gender biases and examined the impact of such biases on four potential applications of LLMs in the clinical domain-namely, medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessment. We conducted experiments with prompts designed to resemble typical use of GPT-4 within clinical and medical education applications. We used clinical vignettes from NEJM Healer and from published research on implicit bias in health care. GPT-4 estimates of the demographic distribution of medical conditions were compared with true US prevalence estimates. Differential diagnosis and treatment planning were evaluated across demographic groups using standard statistical tests for significance between groups.
Citation impact
- FWCI
- 15.51
- Percentile
- 100%
- References
- 52
Authors
12Topics & keywords
- Transformative learning
- Health care
- Medical diagnosis
- Medical care
- Psychology
- Medicine
- Political science
- Nursing
- Gender equality