Seven AI models vote out medical hallucinations

At Binghamton University in New York, researchers have cracked a problem that has plagued medical AI chatbots: the confident-sounding falsehood, or "hallucination," delivered as if it were fact. Ahmed Abdeen Hamed and Luis M. Rocha, working at the Thomas J. Watson College of Engineering and Applied Science's School of Systems Science and Industrial Engineering, have developed a verification protocol that enlists seven AI models to vote on medical diagnoses, and it works with remarkable reliability.

The challenge was urgent. Last year, when Binghamton researchers tested OpenAI's ChatGPT on medical terminology, the chatbot proved accurate at identifying disease terms, drug names, and genetic information—but it also generated a troubling number of hallucinated details: made-up diagnoses, fictional drug interactions, false certainties. As more people turn to chatbots for initial health guidance, these errors could easily mislead someone about a rash, an insect bite, or whether a pain warrants medical attention.

Hamed and Rocha's solution leverages the diversity of open-source language models. Rather than relying on a single AI, they selected seven different large language models and required them to use retrieval-augmented generation—meaning each model had to reference an authoritative database of medical terminology before responding. Over 10,000 experiments, all seven chatbots received the same plain-language symptoms and each generated its own medical term with an official identification number. Then came the vote: Did the models agree?

The results were striking. In 76.85% of cases, at least four of the seven models supported the same answer. The remaining 23.15% were backed by at least two models. Critically, no unmatched terms appeared—no hallucinations. The protocol eliminated the false confidence that makes AI-generated misinformation particularly dangerous.

"The new workflow is incredible," Hamed said, "because it can verify anything from a biomedical point of view—biological knowledge with disease and genetics, translational knowledge from diseases to treatments and clinical trials, and also from a health care point of view with symptoms and treatments." Their findings were published in the journal STAR Protocols.

What makes this approach unusually robust is its reproducibility. With over 100 open-source language models available, researchers can run the same experiment with seven randomly selected models each time, strengthening confidence in the voting pattern with every iteration. This flexibility scales the protocol's power rather than locking it into one version.

The implications extend far beyond diagnosis. Luis M. Rocha, who directs the Complex Adaptive Systems and Computational Intelligence Lab at Binghamton, sees applications in "digital twins" for precision medicine—dynamic virtual replicas of physical processes that could help health care providers optimize outcomes before testing on patients. The protocol could also extract evidence about adverse drug reactions from clinical trials, scientific literature, pharmacological databases, and even social media discourse. Hamed noted that this verification method could curb other kinds of AI hallucinations too: fabricated legal citations, fake academic references, or invented historical details.

The breakthrough represents a practical step toward trustworthy AI in medicine—not by making chatbots smarter in isolation, but by making them accountable to each other. As AI becomes woven into health decisions, this democratic approach to verification offers a path toward answers people can actually rely on.

Seven AI models vote out medical hallucinations in 10,000 chatbot tests