Matthew McDermott stood at the intersection of a fundamental problem: researchers trying to develop lifesaving AI models were wasting months of work doing the same basic data translation over and over again. Now, his team at Columbia University has built something designed to change that. They've created MEDS, an open-source framework that translates the messy, institution-specific languages of hospital data into something machine learning models can instantly understand—no matter which clinic or EHR system the information came from.

The challenge MEDS solves is deceptively simple to state but has plagued health AI research for years. Electronic health records are stored in wildly different formats depending on whether they come from Hospital A, Clinic B, or software vendor C. Before researchers could even begin training a model, they'd need to spend weeks or months writing custom code to transform each dataset into something usable. This duplication meant fewer eyes on the actual science, harder collaboration across institutions, and findings that were difficult for others to reproduce. It's the kind of infrastructure problem that doesn't make headlines but quietly holds back progress.

MEDS addresses this by providing a lightweight, standardized format for representing clinical data in machine learning workflows, combined with open-source tools for data transformation, preprocessing, and benchmarking. "MEDS is a simple way to make all different sources of electronic health record data look the same to your code, regardless of what hospital or clinic or EHR software system the data came from," McDermott explains. The real power lies in what this enables: researchers can now share code and train models across multiple sites without needing to share sensitive patient data or perform the painstaking work of forcing all data into identical clinical vocabularies.

The framework was designed specifically with AI and machine learning applications in mind, complementing rather than replacing the clinical data standards already in use. It's built to support a broad range of biomedical research—from predictive modeling to representation learning to large-scale benchmarking studies that might involve datasets so massive they'd be unmanageable with older tools.

What's particularly striking is how quickly MEDS has been adopted. The framework has already been embraced across 21 institutions spanning 12 countries, suggesting researchers recognize the problem it solves and are eager to spend their time on research rather than plumbing. Because the entire ecosystem is open-source, academics, healthcare systems, and industry researchers can all contribute tools and extensions, creating a collaborative network effect.

McDermott emphasizes that the biggest breakthroughs in AI have always come from communities working together on shared infrastructure. "These impressive results in MEDS are just reflecting the benefits you get when the community can share tools or abstract common parts of their pipelines out into a shared library and use them across everyone's data," he says. As machine learning models increasingly move toward actual clinical deployment, the importance of reproducibility and transparency has only grown sharper. MEDS creates space for both—researchers can now publish models and code that others can actually run on their own data, knowing the framework will handle the translation automatically.

The work was published in NEJM AI, a signal of how seriously the medical AI community is taking this infrastructure challenge. For a field hungry to accelerate progress, removing friction around data standardization isn't glamorous. But it might be exactly the kind of humble, well-built tool that lets researchers spend less time on busywork and more time answering the questions that matter.