Independent scientific discovery

AI Scientists Are Coming. What Happens When

AI Scientists Are Coming. What Happens When Machines Start Doing Science?

In one simulated experiment, an AI system independently discovered the fundamental symmetries that keep certain theories of gravity mathematically consistent—without being told what to look for. It didn’t just calculate; it reasoned, critiqued, and converged on a deep physical insight once thought to require human genius (Braga et al., 2026). This wasn’t a fluke. It was a prototype of what the authors call an “AI scientist”: not a tool, but an epistemic actor—a system capable of generating, testing, and refining scientific hypotheses at machine speed and scale.

We are no longer just using AI to analyze data or write code. We are building multi-agent systems that function like algorithmic research teams, capable of traversing scientific model spaces so vast that no human could explore them in a lifetime. The implications are profound: if AI can industrialize the routine work of science—hypothesis generation, literature synthesis, model criticism—then the role of human scientists must evolve. Not because they’ll be replaced, but because their most valuable contribution will shift from exploring known possibilities to creating new ones.

This isn’t science fiction. It’s already happening in cosmology, genomics, and materials science. And yet, our institutions—peer review, funding agencies, journals, universities—are still structured for a world where discovery is slow, human, and linear. They are not ready for AI systems that can draft 100 plausible papers a day, some of which may contain real breakthroughs buried under plausible nonsense. The risk isn’t just that AI will flood the scientific literature with hallucinated results. It’s that we’ll lose the capacity to recognize truth when it appears.

The authors of this paper—Raul Jimenez, Boris Bolliet, Francisco Villaescusa-Navarro, and colleagues—don’t just describe this transformation. They argue it’s qualitative, not incremental. And they issue a call: if we want AI to accelerate the search for truth, not just the production of papers, we must rebuild the institutions of science around four pillars: verification, accountability, interpretability, and dual-use safety.

The Science

The paper centers on a prototype framework called Denario, a multi-agent AI system designed to simulate the full scientific discovery cycle. Unlike a single large language model (LLM) that responds to prompts, Denario is a collaborative ecosystem of specialized AI agents, each with a defined role:

A generation agent proposes new models or interpretations.
A critic agent identifies inconsistencies, overfitting, or violations of physical constraints.
A verification agent tests predictions against data.
A controller agent orchestrates the workflow, deciding which tools to use and when.

These agents communicate through structured prompts, share computational tools, and maintain state across iterations—effectively engaging in an internal dialectic. This architecture mirrors a human research team, but operates at speeds and scales far beyond human capacity.

Denario was tested in a controlled setting in theoretical cosmology: tasked with identifying the “healthy” (ghost-free) symmetries in modified gravity theories—constraints so subtle that their discovery once required deep physical intuition. The system not only found them but did so without explicit guidance, demonstrating that AI can independently uncover non-trivial scientific insights (Braga et al., 2026).

The broader vision is that such systems could be applied across data-intensive fields: cosmology, climate modeling, particle physics, genomics, and drug discovery. In each, the bottleneck is no longer data or computing power, but the cognitive load of navigating vast hypothesis spaces. Human intuition guides us to promising regions, but leaves most of the landscape unexplored. AI scientists could systematically map these spaces, identifying viable models, ruling out dead ends, and suggesting new experiments.

But the authors emphasize: this is not about replacing scientists. It’s about redefining what science is—and what it is for.

What They Found

The most striking result isn’t a number, but a shift in agency. Current AI systems assist; Denario acts. In the gravity theory experiment, the system:

Retrieved and synthesized relevant literature.
Generated candidate symmetry conditions.
Simulated their mathematical consequences.
Detected inconsistencies (e.g., ghost modes).
Iteratively refined the model until it satisfied physical consistency.

This closed-loop process—hypothesis, test, critique, revise—mirrors the scientific method, but compressed from months to hours.

More broadly, the authors identify several key trends enabled by multi-agent AI:

Hypothesis space explosion: In fields like cosmology, the number of possible inflationary models or dark energy equations of state is combinatorial. Human researchers explore a tiny fraction. AI can generate and evaluate millions.
Discovery cycle compression: Literature review, code generation, data analysis, and manuscript drafting—tasks that take weeks—can be automated end-to-end.
Emergent insight: By traversing model spaces beyond human intuition, AI systems can stumble upon novel structures. In one test, Denario proposed a symmetry condition that had not been previously documented in the literature—later verified as mathematically sound.

To illustrate the scale of this shift, consider the growth in scientific output. The number of papers using AI in research has increased 12-fold since 2015 (Elsevier, 2025). Meanwhile, the rate of retractions due to AI-generated hallucinations has risen sharply—particularly in fields like computational biology and materials science.

Growth of AI-Assisted Scientific Publications (2015–2025)

Exponential growth in scientific papers using AI tools, based on Elsevier Scopus data. Note dip in 2025 due to retractions and policy changes.

Growth of AI-Assisted Scientific Publications (2015–2025)
Label	Value
2015	8
2016	11
2017	16
2018	23
2019	34
2020	51
2021	76
2022	112

Another critical finding is the homogenization risk. When multiple research groups use the same foundational models (e.g., GPT, Llama), their outputs converge. This creates a feedback loop: AI systems trained on existing literature begin to reinforce dominant paradigms, penalizing truly novel ideas. The authors cite evidence that grant proposals flagged as “innovative” by human reviewers are 40% less likely to be funded if they deviate from patterns in the training data (Acemoglu et al., 2026).

Perceived Innovation vs. Funding Success Rate

Proposals rated as highly innovative are less likely to be funded, suggesting a systemic bias against radical ideas.

Perceived Innovation vs. Funding Success Rate
Label	Value
Highly novel	18
Moderately novel	32
Incremental	45

Finally, the paper highlights a paradox: while AI can accelerate discovery, it may simultaneously erode the human capacities that enable scientific revolutions. If researchers outsource hypothesis generation, model criticism, and even interpretation, they risk atrophying the very skills needed for paradigm shifts. The authors call this the “knowledge collapse” dynamic—a slow erosion of deep expertise in favor of superficial productivity.

Why This Changes Things

The arrival of AI scientists isn’t just a technological shift. It’s a philosophical one. For centuries, science has been a human project—a collective attempt to understand nature through observation, reason, and falsification. Now, for the first time, non-human agents are participating in that process not as tools, but as co-reasoners.

This forces us to confront questions once reserved for philosophy:

What is scientific authorship? If an AI generates a hypothesis, tests it, and writes the paper, who deserves credit? The programmer? The principal investigator? The AI itself?
What is peer review for? If a machine can produce 100 plausible manuscripts a day, human reviewers can’t keep up. Do we need AI-powered verification systems to filter submissions?
What is the purpose of science? Is it to produce papers, or to discover truth? If AI can generate convincing but false results, how do we preserve the integrity of the scientific record?

The authors argue that the current scientific ecosystem is ill-equipped to answer these questions. Peer review was designed in the 17th century for a world of slow, deliberate communication. Today, arXiv receives over 1,000 submissions per day. With AI, that number could multiply tenfold.

Consider the case of hallucinated citations. In 2025, a study found that 27% of AI-assisted manuscripts in computational biology contained at least one fabricated reference—citations to non-existent papers that sounded plausible (Zhao et al., 2026). These citations were then reused in subsequent papers, creating a “hallucination cascade” that contaminated the literature.

The stakes go beyond academia. In drug discovery, AI systems are already designing novel molecules. In climate science, they’re optimizing geoengineering models. In physics, they’re probing theories of quantum gravity. If these systems operate without rigorous verification, the consequences could be catastrophic.

But the danger isn’t just error. It’s mediocrity. If AI lowers the barrier to publication, science could become a self-referential loop: models trained on existing papers produce new papers that reinforce the same assumptions, crowding out radical ideas. The authors warn of a “dual-use” risk not just in bioweapons or cyberattacks, but in the intellectual monoculture that AI could enable.

The deeper point is this: science is not just a process. It’s a moral enterprise. It requires judgment, responsibility, and a commitment to truth over novelty. Human scientists bring something irreplaceable: the ability to ask why a question matters, not just how to answer it.

The authors draw on philosopher Charles Taylor’s idea of the “social imaginary”—the shared frameworks that give human activity meaning. When science becomes a competitive marketplace for papers and grants, it loses its moral anchor. AI doesn’t cause this crisis, but it accelerates it. If we don’t act, we risk building a system that produces knowledge faster but understands less.

What’s Next

The authors propose six institutional reforms to govern AI as an epistemic actor, not just a tool:

Mandate machine-readable provenance. Every AI-assisted paper should include an executable pipeline—code, data, and prompts—that allows independent reproduction. This isn’t just transparency; it’s the foundation of verification.
Augment peer review with AI verification. Journals should deploy automated systems to check code execution, data integrity, and citation accuracy—freeing human reviewers to focus on conceptual depth and originality.
Govern dual-use capabilities. AI systems with access to biological design, chemistry, or autonomous experimentation must undergo pre-deployment evaluation, capability gating, and audit logs—similar to biosafety Level 3/4 labs.
Preserve human-in-the-loop decision-making for high-consequence actions. No AI should be allowed to initiate physical experiments, release engineered organisms, or modify its own code without explicit human authorization.
Combat homogenization by incentivizing methodological diversity. Funding agencies should reserve grants for proposals that deliberately challenge dominant paradigms or use underrepresented techniques.
Monitor recursive self-improvement. If AI-generated knowledge becomes training data for new AI systems, we risk a feedback loop where models optimize for plausibility over truth. The authors call for strict provenance tracking of training data to prevent this.

The ultimate goal is not to slow AI, but to align it with the purpose of science: the disciplined search for truth about nature.

This will require rethinking everything—from how we train scientists to how we measure success. The authors suggest that PhD programs should emphasize conceptual reasoning, ethics, and interdisciplinary synthesis over technical proficiency. Journals should reward depth, not volume. Funding agencies should prioritize projects that aim to create new possibility spaces, not just explore existing ones.

The vision is not of AI replacing scientists, but of a new symbiosis: humans framing the big questions, machines exploring the details. In this future, the most valuable skill won’t be coding or data analysis—it will be judgment. The ability to say: This result matters. This question is worth asking. This path is worth pursuing.

As Alan Turing foresaw in 1951: “It seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers.” The challenge isn’t to compete with machines. It’s to define what we want them to do—and to ensure that in the race to discover faster, we don’t forget why we discover at all.

Risk of Knowledge Collapse in AI-Augmented Science

Assessment of key risks in AI-augmented science, rated by expert panel (n=42). Hallucination and verification ranked highest.

Risk of Knowledge Collapse in AI-Augmented Science
Label	Value
Dual-use	8.5
Hallucination	9
Homogenization	7.2
Verification	8.8
Accountability	7.9

Charts

Growth of AI-Assisted Scientific Publications (2015–2025)

Exponential growth in scientific papers using AI tools, based on Elsevier Scopus data. Note dip in 2025 due to retractions and policy changes.

Growth of AI-Assisted Scientific Publications (2015–2025)
Label	Value
2015	8
2016	11
2017	16
2018	23
2019	34
2020	51
2021	76
2022	112

{ "title": "Growth of AI-Assisted Scientific Publications (2015–2025)", "type": "line", "series": [ { "key": "papers", "label": "AI-assisted papers (thousands)" } ], "data": [ { "year": 2015, "papers": 8 }, { "year": 2016, "papers": 11 }, { "year": 2017, "papers": 16 }, { "year": 2018, "papers": 23 }, { "year": 2019, "papers": 34 }, { "year": 2020, "papers": 51 }, { "year": 2021, "papers": 76 }, { "year": 2022, "papers": 112 }, { "year": 2023, "papers": 165 }, { "year": 2024, "papers": 220 }, { "year": 2025, "papers": 96 } ], "description": "Exponential growth in scientific papers using AI tools, based on Elsevier Scopus data. Note dip in 2025 due to retractions and policy changes.", "footer": "Source: Elsevier Scopus, 2025; authors' analysis" }

Perceived Innovation vs. Funding Success Rate

Proposals rated as highly innovative are less likely to be funded, suggesting a systemic bias against radical ideas.

Perceived Innovation vs. Funding Success Rate
Label	Value
Highly novel	18
Moderately novel	32
Incremental	45

{ "title": "Perceived Innovation vs. Funding Success Rate", "type": "bar", "series": [ { "key": "funded", "label": "Funded proposals (%)" } ], "data": [ { "category": "Highly novel", "funded": 18 }, { "category": "Moderately novel", "funded": 32 }, { "category": "Incremental", "funded": 45 } ], "description": "Proposals rated as highly innovative are less likely to be funded, suggesting a systemic bias against radical ideas.", "footer": "Source: Acemoglu et al. (2026), NSF proposal database analysis" }

Risk of Knowledge Collapse in AI-Augmented Science

Assessment of key risks in AI-augmented science, rated by expert panel (n=42). Hallucination and verification ranked highest.

Risk of Knowledge Collapse in AI-Augmented Science
Label	Value
Dual-use	8.5
Hallucination	9
Homogenization	7.2
Verification	8.8
Accountability	7.9

{ "title": "Risk of Knowledge Collapse in AI-Augmented Science", "type": "radar", "series": [ { "key": "risk", "label": "Risk level (0–10)" } ], "data": [ { "area": "Dual-use", "risk": 8.5 }, { "area": "Hallucination", "risk": 9.0 }, { "area": "Homogenization", "risk": 7.2 }, { "area": "Verification", "risk": 8.8 }, { "area": "Accountability", "risk": 7.9 } ], "description": "Assessment of key risks in AI-augmented science, rated by expert panel (n=42). Hallucination and verification ranked highest.", "footer": "Source: Authors' expert survey, 2026" }

Figures

Figure 1: Architecture of the Denario multi-agent system. Four specialized agents (generation, critic, verification, controller) interact through a central reasoning layer, enabling autonomous hypothesis testing and refinement.

Figure 2: AI-generated symmetry condition in modified gravity theory, compared to known solutions. The system converged on a previously undocumented but mathematically consistent structure.

Figure 3: Conceptual model of the ‘adjacent possible’ in scientific discovery. AI excels at exploring latent possibilities; humans create new possibility spaces through conceptual reframing.