Fifty physicians across three leading hospitals sat down with complex medical cases and a question: could ChatGPT Plus improve their diagnostic accuracy? The answer, from researchers at UVA Health, Stanford, and Harvard's Beth Israel Deaconess Medical Center, is more nuanced than a simple yes or no — and it suggests that the future of AI in medicine may depend less on the technology itself than on how doctors learn to use it.
The randomized controlled trial, led by UVA Health's Dr. Andrew S. Parsons, MPH, asked family medicine, internal medicine, and emergency medicine physicians to diagnose "clinical vignettes" based on real patient cases. Half the doctors used ChatGPT Plus alongside their work, while the other half relied on conventional resources like UpToDate and Google. The researchers scored accuracy and timed how quickly each group reached their conclusions.
The results revealed a striking paradox. Physicians using ChatGPT Plus achieved a median diagnostic accuracy of 76.3%, compared with 73.7% for those using traditional methods — a difference statistically similar enough to count as equivalent in clinical terms. The AI-assisted group did work faster, reaching diagnoses in a median of 519 seconds versus 565 seconds for the conventional group. But here is where the study becomes truly surprising: ChatGPT alone, without any human physician involved, achieved a median diagnostic accuracy exceeding 92%.
"We were surprised to find that adding a human physician to the mix actually reduced diagnostic accuracy though improved efficiency," Parsons said. "These results likely mean that we need formal training in how best to use AI."
The finding flips conventional expectations on their head. Rather than demonstrating that AI augments human expertise, the study suggests that doctors — at least those without training in AI tools — may actually interfere with the technology's performance. The researchers theorize that the prompts used in the study were particularly well-designed, indicating that physicians would benefit enormously from formal instruction in how to structure requests and interact with large language models effectively. Alternatively, healthcare organizations might purchase pre-designed prompts optimized for clinical use.
The research also includes an important caveat: the clinical vignettes, while based on real cases, represented a controlled environment far removed from actual medical practice. Real-world diagnosis involves countless considerations beyond pattern recognition — weighing downstream effects of treatment decisions, managing competing medical concerns, and navigating the irreducible complexity of individual patients. The researchers note that ChatGPT Plus would likely struggle with these dimensions of clinical reasoning, and they're already conducting follow-up studies to examine how large language models perform on management and decision-making.
What emerges from this work is not a threat to physicians but rather a call for partnership — one that requires intentional effort to develop. As AI becomes increasingly embedded in healthcare systems, understanding how to maximize its benefits while preserving the judgment that only human doctors bring becomes essential. The four study sites have launched ARiSE, a bi-coastal AI evaluation network, to continue this work and assess generative AI outputs across healthcare more broadly.
For now, the message from Parsons and his colleagues is clear: ChatGPT is best used to augment, not replace, human physicians. But realizing that potential requires that doctors learn to work with these tools as deliberately and skillfully as they learned medicine itself.