At the Icahn School of Medicine at Mount Sinai, researchers have built an AI model that thinks about genes the way ChatGPT thinks about words—understanding that meaning emerges from context. The breakthrough, published in the journal Patterns, introduces a gene set foundation model, or GSFM, that learns how genes function together by analyzing millions of gene groupings from published research and datasets, offering scientists a powerful new lens for understanding disease and discovering treatments.
The work springs from a simple but profound insight: genes rarely act alone. A single gene can behave differently depending on where and when it activates in the cell, much like a word shifts meaning from sentence to sentence. "Genes participate in multiple biological processes, forming different molecular groupings depending on where and when they are active in the cell," explains Avi Ma'ayan, Ph.D., Professor of Pharmacological Sciences and Director of the Mount Sinai Center for Bioinformatics. "Just as modern language models learn the meaning of words from context, we asked whether AI could learn the 'meaning' of genes in the same way."
To train the GSFM, Ma'ayan and his team compiled millions of gene sets from hundreds of thousands of independent research efforts—a massive, diverse collection drawn from published scientific studies and gene expression datasets. They taught the model like solving a puzzle: give it part of a gene set and ask it to predict the missing pieces. Over time, the system learned the underlying patterns describing how genes are grouped and interact across different biological contexts.
The payoff is substantial. By mapping gene relationships across many situations, the model creates a reference framework that helps scientists interpret complex data more effectively. The GSFM can identify functions of poorly understood genes without immediate lab work, highlight genes involved in disease, suggest potential drug targets and biomarkers, and provide a reusable knowledge system for many types of biomedical research. Unlike previous biological AI models that rely primarily on gene expression data, the GSFM is uniquely trained on gene sets—a different and largely underused type of biological information that allows it to integrate diverse data from many diseases, experimental methods, and research conditions.
To test its accuracy, researchers trained the model using gene sets from publications up to a cutoff date, then tested whether it could predict discoveries reported in studies published after that date. It succeeded, demonstrating the ability to identify gene-gene and gene-function relationships before they were confirmed experimentally—a validation that this new "map" of cellular gene organization actually works.
One immediate application lies in gene set enrichment analysis, a widely used method in molecular biology. By improving how scientists interpret gene groupings, the model may uncover new biological insights from both existing and future datasets. The team plans to expand the system further, combining the GSFM with other AI foundation models to generate natural-language explanations of gene functions and eventually integrate it with drug-focused AI models to predict how drugs interact with cells.
For the broader scientific community, the significance is clear: understanding how genes work together in different contexts has long remained one of biology's major unsolved questions. This new tool doesn't just answer that question—it creates a shared reference system that thousands of researchers can use to accelerate discovery across diseases and drug development.
