Multilingual benchmark evaluates how well AI interprets clinical text and health records in nine languages

Jie Yang remembers the moment her team realized most AI couldn’t read a real doctor’s note. At Mass General Brigham in Boston, they had just tested one of the world’s top-performing large language models—one that aced medical licensing exams with a score of 92—on actual clinical text from electronic health records. It scored just 44.8%. That gap sparked BRIDGE, a new multilingual benchmark that’s reshaping how we measure AI’s readiness for real patient care. Unlike traditional assessments built on clean, textbook-style questions, BRIDGE uses messy, real-world clinical data: doctor-patient conversations, EHR entries, and case reports that reflect how medicine is actually practiced. The goal isn’t just to test AI—it’s to make sure it works where it matters most: at the bedside.

The stakes are high. As hospitals increasingly turn to AI for tasks like diagnosis, triage, and billing, using tools trained on oversimplified language risks misreading symptoms, misclassifying conditions, or missing critical nuances in patient history. BRIDGE evaluates performance across 14 clinical specialties and five stages of care, from initial patient contact to post-treatment coding. In testing 95 LLMs from 59 clinical sources, the team found wide performance disparities—not just between models, but across languages and specialties. This is where BRIDGE’s multilingual design becomes transformative: it includes clinical text in nine languages, from Spanish and Mandarin to Arabic and Portuguese, exposing gaps in how well AI serves non-English-speaking patients.

One of the most revealing findings? The same model that excels in oncology might falter in psychiatry or pediatrics. These variations matter deeply in clinical settings where precision is non-negotiable. To bring transparency, Yang and her team—co-led by Dr. Joshua Lin, with co-first authors Jiageng Wu and Bowen Gu—launched a public leaderboard. Now tracking 107 models and growing, it allows clinicians and developers to compare AI performance on real clinical tasks, not just exam scores. The leaderboard isn’t static; it updates as new models emerge, creating a living standard for medical AI.

BRIDGE doesn’t just expose weaknesses—it lights a path forward. By pinpointing where models fail, especially in underrepresented languages and specialties, it guides developers toward more equitable, accurate tools. For hospitals considering AI integration, BRIDGE offers a practical way to choose systems that truly understand clinical reality. As AI becomes embedded in global health, benchmarks like BRIDGE ensure that progress isn’t measured by exam scores alone, but by how well technology serves every patient, in their own language, and in the complexity of real care.