New 'AI scientists' are improving—but reveal

Two new AI systems—Robin and Co-Scientist—are reshaping how scientists discover drugs and test hypotheses, but their creators are honest about what these digital minds cannot do alone. Robin, developed by the nonprofit Future House, and Co-Scientist, built by Google DeepMind, represent a pragmatic approach to artificial intelligence in science: tools that amplify human expertise rather than replace it.

Both systems harness large language models to let researchers interact naturally with the vast ocean of scientific literature. But they diverge in their architecture and application. Co-Scientist, described in a paper just published in Nature, uses "multi-agent" AI—a collection of specialized agents coordinating toward a shared goal. It includes a "reflection agent" that functions as a critical peer reviewer assessing hypothesis quality, and "ranking agents" that debate research proposals in tournaments, with multiple language models simulating discussions about which ideas hold the most promise. Robin takes a more focused path, with agents specifically tuned to drug repurposing: one selects promising experimental tests, another wrestles with complex biomedical data.

The practical results offer both encouragement and humility. When Co-Scientist tackled acute myeloid leukemia, it identified 30 drug candidates worth investigating. Human oncologists refined the list, and five drugs made it to the laboratory. Three showed positive results, and one demonstrated particular promise. For dry age-related macular degeneration, Robin proposed 30 candidates; the top five were selected for testing. Co-Scientist also explored combinations of multiple drugs, suggesting these systems can navigate complexity that might overwhelm individual researchers.

What Co-Scientist cannot do—what neither system can do—is validate ideas through real physical experiments. Both Robin and Co-Scientist stop before the laboratory door. They generate and refine hypotheses, rank them by novelty and impact using methods borrowed from chess rankings, and suggest which experiments to run. But they rely entirely on human scientists to ask the right questions, reality-check the predictions, and decide which proposals warrant the investment of time and resources in actual testing.

This dependency is not weakness; it is realism. The systems thrive in the realm of pattern recognition and reasoning across published knowledge. Yet science's hardest problems demand that synthesis combined with judgment—the scientist's intuition honed by years of experience. When Robin's agents proposed experiments, human scientists overrode several suggestions, and those course corrections likely steered the work toward more fruitful ground.

One notable gap remained in the published work: Co-Scientist's predictions were not compared against the specialized computational and machine learning methods that have been refined over decades of drug-discovery research. No one yet knows whether these general-purpose AI tools outperform the narrowly engineered approaches they might one day replace.

The emergence of Robin and Co-Scientist comes as other organizations race to automate entire scientific processes. The Agents4Science conference at Stanford last October showcased AI systems generating papers across mechanical engineering and protein design, alongside cautionary tales like BadScientist, which deliberately produced convincing but fundamentally unsound research. Those warning signs matter. Studies have documented that AI-generated papers increase in quantity but often decline in quality, complete with fabricated references and misleading images.

What Robin and Co-Scientist demonstrate is a middle path: AI as collaborator, not author. They accelerate the discovery process by handling the knowledge synthesis that would otherwise consume weeks or months. But they preserve the human scientist as the final arbiter of what is worth investigating and what the results mean.

New 'AI scientists' are improving—but reveal their fundamental limits