4 (ChatGPT-4o, GPT-3.5, Gemini, Llama)

Calling Doctor GPT: AI responses to health care

Penn State researchers held an unusual competition on campus last year: a Diagnose-a-thon where 34 faculty, staff, and students submitted 212 prompts asking AI chatbots everyday health questions—the kind millions of people type into ChatGPT or Gemini every day instead of booking a doctor's appointment. What they discovered was both reassuring and cautionary: artificial intelligence responds to common health concerns with nearly 76% accuracy, which sounds promising until you consider that in medicine, a quarter of the answers are wrong.

The work, led by associate professor Amulya Yadav at Penn State's College of Information Sciences and Technology, tackles a blind spot in AI research. While previous studies have examined how large language models handle health information in academic settings, few have focused on how average internet users actually deploy these tools—using them as symptom checkers, the way people once googled their ailments. Understanding real-world accuracy matters because the stakes are personal and sometimes serious.

The researchers asked participants to use any of four leading models—ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b—and compose prompts as naturally as they would on an ordinary day. This participatory approach deliberately mimicked authentic usage, a strength that doctoral candidate Bonam Mingole emphasized: "We're essentially trying to replicate real-world usage of LLMs by telling participants to choose the LLM of their choice and use it as they would on a normal day."

Nine board-certified physicians then evaluated each response on two dimensions—accuracy and potential for harm—using a detailed six-point scale. The results revealed sharp specialty-level differences. Obstetrics and gynecology, along with otolaryngology (ear, nose, and throat medicine), produced the most reliable AI guidance, with high validity scores and minimal harm risk. But internal medicine, neurology, and dermatology emerged as danger zones: the AI performed poorly, with low validity scores and elevated potential for harm. Notably, the researchers also discovered that very specific prompts—and those between 60 and 250 characters in length—consistently yielded more accurate responses.

The team took their investigation further. They took the base models and augmented them with training on medical textbooks, clinical guidelines, and peer-reviewed research articles from medical school curricula, then had medical professionals reassess the responses to see if this specialized knowledge improved accuracy and reduced harm.

"Our work focuses explicitly on health care scenarios that the average internet user might ask AI," Yadav explained, "which is a perspective that prior research into large language models and health care hasn't covered." The practical implication is clear: for specialized areas like neurology and dermatology, these tools may work best when physicians use them rather than patients relying on them directly.

The findings will be presented at the 2026 Association for Computing Machinery Fairness, Accountability and Transparency conference in Montreal in June, offering the research community a timely snapshot of where AI health tools stand and where they remain dangerously unreliable.

Calling Doctor GPT: AI responses to health care queries are nearly 76% accurate