The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study

Bajwa, Maria; Hoyt, Robert; Knight, Dacre; Haider, Maruf

doi:10.2196/76822

articleJMIRx MedMar 23, 2026GOLD OA

The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study

MBMaria Bajwa RHRobert Hoyt DKDacre Knight MHMaruf Haider

MGH Institute of Health Professions · Virginia Commonwealth University · +2 more institutions

PubMed

Indexed incrossrefdoajpubmed

Abstract

Background

Generative artificial intelligence models, especially reasoning large language models (LLMs), are gaining adoption in health care for diagnostic decision support and medical education. DeepSeek R1 is a reasoning LLM that generates extended chain-of-thought explanations to make its decision-making process more explicit. Traditional medical benchmarks often lack complexity and authenticity, motivating the adoption of scenario-rich datasets, such as the Massive Multitask Language Understanding Pro (MMLU-Pro) professional medicine subset, which provides multispecialty clinical vignettes for reasoning-centric evaluation.

Objective

The objective of this study is to assess the diagnostic accuracy, reasoning quality, reasoning transparency, and practical usability of DeepSeek R1 and Gemini 3 Pro across closed- and open-ended clinical scenarios, with the intention of guiding their prospective application in practical clinical education and training. This evaluation was conducted by analyzing 162 diverse medical scenarios (both closed- and open-ended) from the MMLU-Pro health subset.

Citation impact

4

total citations

FWCI: 81.57
Percentile: 99%
References: 22

Too recent for citation history.

Authors

4

Topics & keywords

Topics

Keywords

Perspective (graphical)
Identification (biology)
Variety (cybernetics)
Set (abstract data type)
Feature (linguistics)

No related works found for this paper.