Meridia Insight Tech for Good Frontiers

The AI Optimizer That Was Secretly Darwinian All Along

A philosopher-mathematician at Yale just proved that gradient descent and natural selection are the same equation — and used it to fix one of AI's most popular

SGD, Natural Gradient, and Newton's method are already perfect simulations of Darwinian evolution.

Somewhere inside the billions of training steps that taught a large language model to write poetry or a neural network to detect cancer, there is — quietly, faithfully — Darwinian evolution running. Not as a metaphor. Not as loose inspiration. As the actual mathematics.

That is the central claim of a remarkable paper by Daniel Grimmer, a philosopher and mathematician at Yale, published on arXiv in May 2026. Grimmer proves that several of the most important optimization algorithms in modern machine learning are already — without any modification — scientifically valid simulations of Darwinian evolutionary dynamics. And for the one major algorithm that isn't quite there, he performs what he calls "minor but principled mathematical surgery" to fix it.

The implications run in two directions simultaneously. Machine learning researchers gain a deep biological grounding for tools they already use. Evolutionary biologists gain something they have long lacked: a rigorous, in silico laboratory for running controlled experiments on the fundamental principles of Darwinian adaptation.

The Science

To understand what Grimmer (2026) has done, it helps to understand what he's working against. The field of evolutionary computation has always had a dual mandate — to build good optimization algorithms and to faithfully simulate Darwinian evolution — but these two goals have rarely been pursued together. For decades, the engineering side dominated: researchers borrowed biological vocabulary (populations, fitness, selection) but replaced the actual mathematics of evolution with heuristics borrowed from physics, or simply dressed up existing algorithms in biological clothing. Critics have called this the "metaphor crisis," a proliferation of supposedly novel algorithms that are actually recycled methods under zoological names (Sörensen, 2015; Campelo and Aranha, 2023).

Grimmer's approach, called Darwinian Lineage Simulations (DLS), does the opposite: it starts with evolutionary biology's foundational equations and derives optimization algorithms from them, rather than retrofitting biology onto existing tools.

The starting point is a 100-year-old argument. Ronald Fisher and Sewall Wright — two founding giants of population genetics — held deeply opposing views about how evolution actually works. Fisher (1930) saw evolution as a deterministic process: a large, well-mixed population climbing a fitness landscape driven by selection acting on genetic variance. Wright (1931, 1932) disagreed. He argued that small, isolated sub-populations undergoing random genetic drift — the statistical noise that comes from sampling a finite population — were essential for evolution to escape local fitness peaks and explore new solutions. The two men argued bitterly for decades.

Grimmer proves they were both right, and that their theories are formally equivalent — at least for asexual reproduction. The key insight is that Fisher's deterministically-evolving total population can always be decomposed into Wright's randomly-drifting sub-populations, and those sub-populations can always be reassembled to recover Fisher's dynamics exactly. The magic that makes this work is what Grimmer calls the DLS noise relation: a precise mathematical constraint on what genetic drift must look like for the decomposition to be evolutionarily faithful.

Figure 2: An illustration of our ability to track the total population’s evolution by independently evolving many sub-populations. a) The total population (blue) can be represented as a “super-distribution” of sub-populations (orange). b) Evolving each of these sub-populations (orange) independently and then reassembling them (weighted by their average fitness) exactly recovers the evolved total population (blue).
Figure 2: An illustration of our ability to track the total population’s evolution by independently evolving many sub-populations. a) The total population (blue) can be represented as a “super-distribution” of sub-populations (orange). b) Evolving each of these sub-populations (orange) independently and then reassembling them (weighted by their average fitness) exactly recovers the evolved total population (blue). Source: Daniel Grimmer

The DLS noise relation links three quantities. Let be the mutation rate, be the population's genotype variance at generation , and be the covariance of genetic drift. The relation states, at leading order:

This is not a modeling choice — it is a constraint that must be satisfied if the simulation is to faithfully represent evolutionary dynamics. The genetic drift in any valid simulation must absorb exactly the difference between what mutations add to the variance and what selection removes. Crucially, as long as this relation is satisfied, the researcher is free to choose any bookkeeping arrangement they like for dividing the total population into sub-populations. That freedom, it turns out, is enormous — and it is precisely what allows DLS to encompass such a wide range of optimization algorithms.

What They Found

The paper's most striking result is what falls out when you examine well-known optimization algorithms through the DLS lens.

Stochastic Gradient Descent (SGD) — the workhorse of machine learning, which updates model parameters by following the gradient of a noisy estimate of the loss function — is already a faithful DLS simulation. Add evolutionarily faithful genetic drift (i.e., noise that satisfies the DLS noise relation) and SGD becomes a scientifically valid in silico experiment on Darwinian evolution.

Natural Gradient Descent — a more sophisticated algorithm that accounts for the geometry of the parameter space, pre-conditioning the gradient by the inverse Fisher information matrix — is also already DLS-compliant. This is not a coincidence: as Grimmer notes, the pre-conditioner in Lande's (1976) equation of quantitative genetics, , has a natural information-geometric interpretation as exactly this kind of natural gradient (Otwinowski et al., 2020). Evolution was doing natural gradient ascent all along.

The Damped Newton's Method — which uses second-order curvature information to take more efficient steps — also fits the DLS framework. In biological terms, the population's variance naturally acts as an anisotropic pre-conditioner: it is the spread of genotypes in the population that determines how far and in which directions selection can move the mean genotype in one generation. Fisher's "key insight," as Grimmer frames it, is that the current genetic variance is the learning rate.

Figure 3: The trajectory of a Darwinian lineage is plotted in a 2D genotype space (ϕA,ϕB\phi_{A},\phi_{B}) without down-sampling. Each shaded disk represents the lineage’s genotype distribution, pg​(ϕ)p_{g}(\phi), at a specific generation. The point at the center of each disk is the mean genotype, ϕg\phi_{g}, while the radius represents its standard deviation, σg\sigma_{g}. This figure uses the isotropic update rule from Theorem 1. The arrows indicate the gradient of the log-fitness function, ∇log⁡(ℱλ)​(ϕg)\nabla\log(\mathcal{F}_{\lambda})(\phi_{g}), at each generation. Notice that even as these gradients remain the same size, the rate of genotype change, Δ​ϕg=σg2​∇log⁡(ℱλ)​(ϕg)\Delta\phi_{g}=\sigma_{g}^{2}\nabla\log(\mathcal{F}_{\lambda})(\phi_{g}), increases. This is because (in alignment with Fisher’s key insight) the lineage’s current amount of variation, σg2\sigma_{g}^{2}, acts as its learning rate. This acceleration will continue at least until σg\sigma_{g} becomes comparable to the curvature scale of log⁡(ℱλ)\log(\mathcal{F}_{\lambda}). The need to control the variance of the lineage that we track motivates the down-sampling procedures introduced in Sec. 2.4.
Figure 3: The trajectory of a Darwinian lineage is plotted in a 2D genotype space (ϕA,ϕB\phi_{A},\phi_{B}) without down-sampling. Each shaded disk represents the lineage’s genotype distribution, pg​(ϕ)p_{g}(\phi), at a specific generation. The point at the center of each disk is the mean genotype, ϕg\phi_{g}, while the radius represents its standard deviation, σg\sigma_{g}. This figure uses the isotropic update rule from Theorem 1. The arrows indicate the gradient of the log-fitness function, ∇log⁡(ℱλ)​(ϕg)\nabla\log(\mathcal{F}_{\lambda})(\phi_{g}), at each generation. Notice that even as these gradients remain the same size, the rate of genotype change, Δ​ϕg=σg2​∇log⁡(ℱλ)​(ϕg)\Delta\phi_{g}=\sigma_{g}^{2}\nabla\log(\mathcal{F}_{\lambda})(\phi_{g}), increases. This is because (in alignment with Fisher’s key insight) the lineage’s current amount of variation, σg2\sigma_{g}^{2}, acts as its learning rate. This acceleration will continue at least until σg\sigma_{g} becomes comparable to the curvature scale of log⁡(ℱλ)\log(\mathcal{F}_{\lambda}). The need to control the variance of the lineage that we track motivates the down-sampling procedures introduced in Sec. 2.4. Source: Daniel Grimmer

The notable outlier is Adam, currently the most widely-used optimizer in deep learning. Adam combines gradient descent with two adaptive mechanisms: a momentum term that accumulates a running average of past gradients, and an RMSProp term that rescales steps based on the magnitude of recent gradients. Together, these make Adam extraordinarily effective in practice — but Adam's momentum term, as Grimmer shows, violates the DLS noise relation. It introduces correlations across generations that have no counterpart in evolutionary dynamics.

The fix — "Adam-DLS" — involves replacing Adam's additive momentum with a rank-1 extension of the population's variance in the direction of accumulated historical gradient. Geometrically, this elongates the tracked lineage's genotype distribution along the direction it has been moving, much as a population spreading through a fitness valley will stretch in the direction of travel. The update still achieves an Adam-like momentum effect, but now via a biologically meaningful mechanism: the shape of the population's variance, rather than an explicit memory term tacked on from outside.

Figure 5: The Rosenbrock benchmark (with a=2a=2, b=100b=100) is attempted by three algorithms: Noisy Gradient Ascent, Noisy Adam, and Adam-DLS. While
noisy gradient ascent is evolutionarily compliant (it fits the DLS framework) it cannot solve this benchmark. With an isotropic variance, evolutionary fidelity, σg2​Hg≪I\sigma_{g}^{2}H_{g}\ll I, bounds the allowed step size by the high curvature across the ridge. As a result, the lineage experiences almost no selection pressure and hence follows a mutation-dominated random walk. Concretely, at (−2,4)(-2,4) the spectral norm of the Hessian is ‖H‖2=3402||H||_{2}=3402 forcing σg2=0.01/‖H‖2=3×10−6≪μ2=10−4\sigma_{g}^{2}=0.01/||H||_{2}=3\times 10^{-6}\ll\mu^{2}=10^{-4}. A more sophisticated optimizer (such as Adam) is necessary to solve the benchmark. While Adam is not a faithful simulation of evolution, Adam-DLS is (see Sec. 3.2). Evolutionary compliance, Vg​Hg≪IV_{g}H_{g}\ll I, permits an anisotropic variance which is small across the ridge but large along the ridge. In Adam-DLS, this elongation has two sources: a RMSProp-like diagonal preconditioner, DgD_{g}, and a rank-1 extension in the (pre-conditioned) direction of historical momentum, Dg​mgD_{g}m_{g}. The latter term has the exact same orientation as Adam’s additive momentum and yields a similar effect. Importantly, Adam-DLS (unlike Adam) is a faithful simulation of evolution. Moreover it is the first strictly evolutionarily faithful gradient-based model to pass the Rosenbrock benchmark. (Both Adam and Adam-DLS take α=10−3\alpha=10^{-3}, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, ϵ=10−8\epsilon=10^{-8}, and μ2=10−4\mu^{2}=10^{-4}. These hyper-parameters maintain Tr​(Vg​Hg)≲0.01\text{Tr}(V_{g}H_{g})\lesssim 0.01 and Wg⪰0W_{g}\succeq 0 without ad hoc mutation spikes. A histogram of the dgd_{g} scalar is unimodal with a mean of 0.980.98 and standard deviation of 6.706.70.) An interactive reproduction of this benchmark is available at https://github.com/danielgrimmer/adam-dls.
Figure 5: The Rosenbrock benchmark (with a=2a=2, b=100b=100) is attempted by three algorithms: Noisy Gradient Ascent, Noisy Adam, and Adam-DLS. While noisy gradient ascent is evolutionarily compliant (it fits the DLS framework) it cannot solve this benchmark. With an isotropic variance, evolutionary fidelity, σg2​Hg≪I\sigma_{g}^{2}H_{g}\ll I, bounds the allowed step size by the high curvature across the ridge. As a result, the lineage experiences almost no selection pressure and hence follows a mutation-dominated random walk. Concretely, at (−2,4)(-2,4) the spectral norm of the Hessian is ‖H‖2=3402||H||_{2}=3402 forcing σg2=0.01/‖H‖2=3×10−6≪μ2=10−4\sigma_{g}^{2}=0.01/||H||_{2}=3\times 10^{-6}\ll\mu^{2}=10^{-4}. A more sophisticated optimizer (such as Adam) is necessary to solve the benchmark. While Adam is not a faithful simulation of evolution, Adam-DLS is (see Sec. 3.2). Evolutionary compliance, Vg​Hg≪IV_{g}H_{g}\ll I, permits an anisotropic variance which is small across the ridge but large along the ridge. In Adam-DLS, this elongation has two sources: a RMSProp-like diagonal preconditioner, DgD_{g}, and a rank-1 extension in the (pre-conditioned) direction of historical momentum, Dg​mgD_{g}m_{g}. The latter term has the exact same orientation as Adam’s additive momentum and yields a similar effect. Importantly, Adam-DLS (unlike Adam) is a faithful simulation of evolution. Moreover it is the first strictly evolutionarily faithful gradient-based model to pass the Rosenbrock benchmark. (Both Adam and Adam-DLS take α=10−3\alpha=10^{-3}, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, ϵ=10−8\epsilon=10^{-8}, and μ2=10−4\mu^{2}=10^{-4}. These hyper-parameters maintain Tr​(Vg​Hg)≲0.01\text{Tr}(V_{g}H_{g})\lesssim 0.01 and Wg⪰0W_{g}\succeq 0 without ad hoc mutation spikes. A histogram of the dgd_{g} scalar is unimodal with a mean of 0.980.98 and standard deviation of 6.706.70.) An interactive reproduction of this benchmark is available at https://github.com/danielgrimmer/adam-dls. Source: Daniel Grimmer

Grimmer tests Adam-DLS on the Rosenbrock benchmark — a famously difficult optimization problem shaped like a curved, narrow banana-shaped valley — and it passes. Plain noisy gradient descent (which is evolutionarily compliant) cannot solve this benchmark because evolutionary fidelity forces its variance to remain isotropic, making the allowed step size tiny in the high-curvature direction. Adam-DLS, with its anisotropic variance, navigates the ridge successfully. It is, the paper notes, "the first strictly evolutionarily faithful gradient-based model to pass the Rosenbrock benchmark."

Algorithms and Their Evolutionary Compliance

Whether major optimization algorithms satisfy the DLS (Darwinian Lineage Simulation) evolutionary fidelity constraints, as determined by Grimmer (2026). Compliance score: 1 = fully compliant, 0 = non-compliant without modification.

Algorithms and Their Evolutionary Compliance
LabelValue
SGD1
Natural Gradient Descent1
Damped Newton's Method1
Adam (standard)0
Adam-DLS (surgically repaired)1

Adam-DLS Rosenbrock Benchmark: Key Hyperparameters

Hyperparameter settings used in the Rosenbrock benchmark comparison between Adam and Adam-DLS (Grimmer, 2026, Fig. 5). Both algorithms use identical settings; Adam-DLS additionally satisfies evolutionary compliance constraints.

Adam-DLS Rosenbrock Benchmark: Key Hyperparameters
LabelValue
Solves Rosenbrock1
Evolutionary Compliance0
Anisotropic Variance1
Momentum Term1
RMSProp Scaling1
DLS Noise Relation0

Why This Changes Things

The significance here operates at several levels, and it's worth separating them.

For machine learning researchers, the DLS framework provides something that has been lacking: a principled, biologically-grounded account of why certain optimizers work. The fact that natural gradient descent corresponds to Lande's equation from quantitative genetics is not decorative. It means that the geometrical intuitions behind information-geometric optimization have a direct evolutionary interpretation. The population's variance as a learning rate is not an analogy — it is the same mathematical object, playing the same mathematical role, in both systems.

For evolutionary biologists, the implications may be even more significant. One of the persistent criticisms of computational evolution has been that the best-performing algorithms — the ones you'd actually want to use to study evolution — have abandoned biological fidelity for engineering convenience. Grimmer's work dissolves that trade-off. Because SGD, Natural Gradient Descent, and Newton's method already satisfy the DLS constraints (once given evolutionarily faithful drift), researchers can now use these high-performance algorithms as scientifically valid experimental platforms, knowing that what they observe corresponds to real evolutionary dynamics.

This matters for questions that are genuinely hard to study in living organisms. How does the rate of genetic drift interact with the shape of a fitness landscape to determine the rate of adaptation? What happens when a population encounters a completely flat region — a fitness plateau — with no gradient to follow? Grimmer shows (with a maze-like fitness function in Figure 1) that Fisher's deterministic mass selection can actually solve such mazes without genetic drift, through mutation pressure acting as a deterministic diffusion process. This is a concrete, testable prediction that the DLS framework makes possible.

Figure 1: A maze-like fitness function, ℱ​(ϕ)\mathcal{F}(\phi), plotted in a 2D genotype space (ϕA,ϕB\phi_{A},\phi_{B}). The dark regions represent zero-fitness boundaries enclosing flat corridors of constant fitness. There is a single exit at the top leading to an unobstructed region of higher fitness. Suppose that the initial population is localized in a remote corner of the maze. One might wonder whether random genetic drift is required to solve this maze. A Wright-like Darwinian Lineage Simulation (see Theorem 2) will eventually solve this maze via the stochastic wandering of genetic drift. Indeed, almost every lineage will eventually wander its way out of this maze (some faster than others). Perhaps surprisingly, Fisher’s model of deterministic mass selection, Eq. (2.1), will also eventually solve this maze without genetic drift, see Sec. 2.1. It does so via a combination of strong selection at the walls/dead-ends and blind mutation pressure (which acts as a deterministic diffusion process expanding the population’s variance along neutral corridors). Indeed, because of the asexual Fisher-Wright equivalence the deterministic Fisherian dynamics will solve this maze exactly as fast as the fastest Darwinian Lineage Simulation. So was the first individual to break out of this maze (and their many, many children) aided by genetic drift? From a lineage perspective, yes. From a population perspective, no.
Figure 1: A maze-like fitness function, ℱ​(ϕ)\mathcal{F}(\phi), plotted in a 2D genotype space (ϕA,ϕB\phi_{A},\phi_{B}). The dark regions represent zero-fitness boundaries enclosing flat corridors of constant fitness. There is a single exit at the top leading to an unobstructed region of higher fitness. Suppose that the initial population is localized in a remote corner of the maze. One might wonder whether random genetic drift is required to solve this maze. A Wright-like Darwinian Lineage Simulation (see Theorem 2) will eventually solve this maze via the stochastic wandering of genetic drift. Indeed, almost every lineage will eventually wander its way out of this maze (some faster than others). Perhaps surprisingly, Fisher’s model of deterministic mass selection, Eq. (2.1), will also eventually solve this maze without genetic drift, see Sec. 2.1. It does so via a combination of strong selection at the walls/dead-ends and blind mutation pressure (which acts as a deterministic diffusion process expanding the population’s variance along neutral corridors). Indeed, because of the asexual Fisher-Wright equivalence the deterministic Fisherian dynamics will solve this maze exactly as fast as the fastest Darwinian Lineage Simulation. So was the first individual to break out of this maze (and their many, many children) aided by genetic drift? From a lineage perspective, yes. From a population perspective, no. Source: Daniel Grimmer

There is also a deeper conceptual realignment underway here. The persistent association of evolutionary computation with gradient-free optimization — the assumption that you only reach for evolutionary methods when you can't compute gradients — turns out to be a historical accident, not a mathematical truth. Grimmer is blunt about this: "The strict association of evolution with gradient-free methods is therefore an artifact of a particular computational tradition, not a reflection of evolution's own mathematical structure." Fitness gradients are what selection automatically computes. The population's variance is the learning rate. Evolution has always been doing gradient ascent; we just weren't looking at it that way.

This also resolves a long-standing tension in the field between what Grimmer calls the "metaphor crisis" and the genuine scientific mandate of evolutionary computation. Previous bridges between gradient descent and evolutionary dynamics — notably the work of Kucharavy et al. (2023) on Gillespie-Orr Evolutionary Algorithms and Frank's (2025) Force-Metric-Bias framework — either remained limited to basic SGD or provided a taxonomic classification of algorithms without being able to say whether any given algorithm was truly evolutionarily faithful. The DLS framework is both generative and diagnostic: it tells you not just that an algorithm resembles evolution, but whether it is evolution, and if not, exactly what to change.

What's Next

The paper is explicit about its primary contribution being interpretive and foundational rather than algorithmic. Grimmer is not claiming to invent new optimizers. What is new is the revelation — and the proof — that the mathematical skeleton of Darwinian evolution has been quietly present inside tools that the machine learning community built for entirely different reasons.

Several open directions follow naturally. The DLS framework is currently derived for asexual reproduction; extending it to sexual reproduction would require handling genetic recombination, which breaks the clean decomposition into independent sub-populations. That is a substantially harder problem, but it would open up the framework to modeling a far wider range of biological systems.

There is also the question of what the DLS noise relation implies for practical optimizer design. The constraint links the amount of noise in each step to the change in the population's variance. This is not just a biological nicety — it may have practical consequences for optimizer stability and generalization, since it couples exploration (genetic drift) to the optimizer's current state in a principled way. Whether Adam-DLS outperforms vanilla Adam on real deep learning benchmarks remains to be tested, and Grimmer has made an interactive implementation available at GitHub for exactly this purpose.

Perhaps most intriguingly, the DLS framework raises the question of what other algorithms, not yet examined, might already be evolutionarily faithful — and what undiscovered optimizers might fall out of the evolutionary equations if you look in the right places. The paper demonstrates that the fitness landscape of mathematical optimization and the fitness landscape of biological evolution are not just analogous. In a precise, provable sense, they are the same landscape. We have been climbing it all along — we just didn't know whose footsteps we were following.

The strict association of evolution with gradient-free methods is therefore an artifact of a particular computational tradition, not a reflection of evolution's own mathematical structure.

Comments (0)

No comments yet. Be the first to share your thoughts.