The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study
MGH Institute of Health Professions · Virginia Commonwealth University · +2 more institutions
Abstract
Generative artificial intelligence models, especially reasoning large language models (LLMs), are gaining adoption in health care for diagnostic decision support and medical education. DeepSeek R1 is a reasoning LLM that generates extended chain-of-thought explanations to make its decision-making process more explicit. Traditional medical benchmarks often lack complexity and authenticity, motivating the adoption of scenario-rich datasets, such as the Massive Multitask Language Understanding Pro (MMLU-Pro) professional medicine subset, which provides multispecialty clinical vignettes for reasoning-centric evaluation.
The objective of this study is to assess the diagnostic accuracy, reasoning quality, reasoning transparency, and practical usability of DeepSeek R1 and Gemini 3 Pro across closed- and open-ended clinical scenarios, with the intention of guiding their prospective application in practical clinical education and training. This evaluation was conducted by analyzing 162 diverse medical scenarios (both closed- and open-ended) from the MMLU-Pro health subset.
Citation impact
- FWCI
- 81.57
- Percentile
- 99%
- References
- 22
Authors
4Topics & keywords
- Perspective (graphical)
- Identification (biology)
- Variety (cybernetics)
- Set (abstract data type)
- Feature (linguistics)