The Optimizer That Teaches AI to Walk the Tightrope — Not Just Balance on Training Wheels
A new optimizer called DoPr improves robot task success and language model quality without touching the data, architecture, or objective — just by changing how
Robots trained with DoPr succeed at manipulation tasks where standard optimizers fail — with zero changes to data or arc
Imagine you are training a robot arm to sort objects on a table. You feed it thousands of expert demonstrations, minimize the prediction error until the numbers look excellent, and deploy. Within a few moves, the robot drifts. Each small action error shifts the state of the world slightly away from anything it saw during training. By step ten, it is navigating territory its training loss never touched. The task fails — despite a validation loss your dashboard would call a success.
This is the central problem that Zhang, Shah, Zhang, Zhang, Matni, and Simchowitz confront in their 2026 paper introducing Double Preconditioning (DoPr) (Zhang et al., 2026). Their core claim is quietly radical: the choice of optimizer — not the data, not the architecture, not the loss function — can directly determine whether a model survives contact with reality. And today's most popular optimizers, including the celebrated Adam and the newly prominent Muon, are specifically optimized for the wrong thing.
The Science
The researchers define a concept they call test-time feedback (TTF): the systematic mismatch between the distribution a model is trained on and the distribution it encounters when deployed. This isn't the ordinary machine learning worry about overfitting or domain shift from a different dataset. TTF shift is self-inflicted. When a language model generates token by token, each output becomes the input for the next prediction. When a robot acts in the world, each action changes the state the robot must respond to next. The model's own imperfections reshape its own future inputs, compounding errors in a feedback loop that grows with task length.
TTF is ubiquitous in contemporary AI. Autoregressive language models live entirely inside it. So do flow-based generative models, diffusion policies, and any robot trained on behavioral cloning — imitating expert demonstrations step by step. In all these cases, models are trained with per-step supervised losses ($L^2$ regression, cross-entropy), but deployed in a regime where the distribution of inputs is determined by the model's own sequential choices.
The authors formalize TTF rigorously as a problem in behavior cloning within a Markov Decision Process (MDP) — a mathematical framework for sequential decision-making where an agent takes actions, transitions to new states, and accumulates reward. The key insight is that the training loss is evaluated under the distribution of states visited by the expert demonstrator, while the test-time reward is evaluated under the distribution of states the learned policy visits. These two distributions diverge as soon as the learned policy makes even small mistakes, and the divergence compounds across time.
The team is based at the University of Pennsylvania and Carnegie Mellon University, with co-advising from Amazon FAR. Their approach is purely an optimizer intervention — no new data collection, no architectural changes, no modified training objectives. They evaluate across continuous-control locomotion (Humanoid-v5 from MuJoCo), dexterous robot manipulation (Tool Hang and Transport from RoboSuite), and language generation tasks.
What They Found
The foundation of DoPr rests on a theoretical analysis of what goes wrong with standard optimizers under TTF. The key observation: popular optimizers like Adam and Muon are gradient-wise preconditioners (GP) — they reshape gradient updates using statistics about the gradient itself. What they do not account for is the statistics of the activations — the internal representations that each layer computes from its inputs.
When the input distribution to a layer is non-isotropic (meaning some input directions carry far more variance than others — a common condition in practice), the network learns features unevenly. It becomes excellent at predicting along high-variance directions and poor along low-variance ones. Under ordinary i.i.d. evaluation, this might not matter much. But under TTF, where the model's own errors gradually shift the input distribution, those neglected low-variance directions become exactly the directions the model encounters during deployment. The errors in those directions get amplified sequentially, driving TTF shift.
Critically, the authors prove (Proposition 3.2 in the paper) that these feature learning deficits at earlier layers cannot be fixed by updates at later layers. The damage is irreversible in the forward pass.
Their solution is activation-wise preconditioning (AP). Before passing a gradient update to the standard optimizer, DoPr first multiplies it by the inverse of the input covariance matrix , where is the empirical covariance of activations at that layer. Formally:
This corrected gradient is then passed to any standard gradient-based optimizer (Adam, Muon, etc.) for the actual parameter update:
The result is a modular, drop-in modification. You do not replace your optimizer with DoPr; you insert AP as a pre-processing step before your existing optimizer. The computational overhead is modest relative to the training pass itself.
The invariance guarantee is clean and provable: DoPr updates are invariant to affine transformations of the input distribution (Proposition 4.2). If you rescale or rotate the inputs to a layer, a standard optimizer's trajectory changes; DoPr's does not.
This invariance directly addresses the feature learning deficit — the optimizer becomes equally attentive to all input directions, regardless of their variance in the training data.
Across experiments, the results are consistent and striking.
Humanoid locomotion. On the Humanoid-v5 MuJoCo benchmark — a notoriously challenging task where a simulated humanoid must learn to walk and run stably — DoPr variants of AdamW, Muon, Signum, and AdaMuon all achieved higher terminal reward than their baseline counterparts. Crucially, these gains did not consistently accompany lower training or validation loss (Zhang et al., 2026). This is the TTF gap made visible: the model that logs the better loss number is not the model that actually walks.
Robot manipulation. On Tool Hang and Transport from the PushT/Hanoi (PH) suite — dexterous manipulation tasks that require multi-step precision — DoPr variants outperformed their baselines on best success rate across all three random seeds tested. The improvement was consistent whether the base optimizer was AdamW or Muon.
Humanoid-v5: Terminal Reward by Optimizer
Terminal locomotion reward for Humanoid-v5 comparing standard optimizers vs. DoPr variants. DoPr variants consistently achieve higher terminal reward without consistent improvements in validation loss.
| Label | Value |
|---|---|
| AdamW | 3,200 |
| DoPr-AdamW | 5,100 |
| Muon | 3,600 |
| DoPr-Muon | 5,400 |
| Signum | 3,000 |
| DoPr-Signum | 4,900 |
| AdaMuon | 3,400 |
| DoPr-AdaMuon | 5,200 |
Robot Manipulation: Best Success Rate
Median best success rate across 3 random seeds on Tool Hang (PH) and Transport (PH) benchmarks, comparing standard vs. DoPr optimizer variants.
| Label | Value |
|---|---|
| Tool Hang – AdamW | 28 |
| Tool Hang – DoPr-AdamW | 52 |
| Tool Hang – Muon | 32 |
| Tool Hang – DoPr-Muon | 58 |
| Transport – AdamW | 35 |
| Transport – DoPr-AdamW | 60 |
| Transport – Muon | 38 |
| Transport – DoPr-Muon | 65 |
The theoretical analysis using linear dynamical systems (LDS) provides an exact closed-form characterization of why this happens. In the LDS setting, validation loss is proportional to — the error weighted by the demonstrator's state covariance — while test-time reward is proportional to — the same error weighted by the learner's state covariance. These two objectives weight the same policy error in fundamentally different directions. Minimizing one does not guarantee minimizing the other.
Why This Changes Things
The implications ripple outward in several directions simultaneously.
For the optimizer design community, DoPr reopens a question that has been quietly assumed closed. The dominant paradigm — judging optimizers by their "time to target validation loss" on benchmarks like NanoGPT pretraining or AlgoPerf — may be selecting for the wrong property. The authors note that Shampoo won the AlgoPerf optimizer competition on "holdout error per unit compute," and Muon came to prominence on NanoGPT pretraining speed (Zhang et al., 2026). Both are measured by validation loss. If validation loss and downstream task performance systematically diverge under TTF — as the paper argues and demonstrates — then the optimizer leaderboard may be ranking solutions to the wrong problem.
For robotics, the implications are immediate. Behavioral cloning is the dominant paradigm for teaching robots from human demonstrations, and it lives entirely within the TTF regime. The compounding error problem is well-known (Ross et al., 2011 first described it formally; it inspired the DAgger algorithm, which addresses it by collecting new demonstrations on the fly). DoPr offers a complementary path: fix the optimizer, not the data collection pipeline. This matters practically because acquiring new demonstrations is expensive, slow, and sometimes impossible.
For language models, the connection is subtler but real. Autoregressive generation is by definition a TTF process — every token the model generates becomes part of the context for the next. Long-horizon coherence, factual consistency across a document, multi-step reasoning chains — all of these are places where TTF shift can degrade performance even when per-token cross-entropy looks fine. DoPr's language generation experiments, while preliminary, suggest that the gap between perplexity and generation quality may be partially addressable through optimizer choice.
For flow-based generative models (a third domain the paper identifies as TTF-afflicted), the mechanism is analogous. Flow models generate samples by iteratively transforming noise through a learned sequence of steps; each step's output feeds the next. Feature learning quality at intermediate layers propagates through the chain.
There is also a deeper conceptual contribution here. The paper formalizes an "ideal" loss — what you would minimize if you could evaluate the training loss under the states the learned policy actually visits, rather than the demonstrator's states. This ideal loss is the correct objective for TTF settings, but it is circular (it depends on the policy you are trying to learn) and requires data you typically do not have. DoPr's activation-wise preconditioning can be understood as implicitly optimizing towards without ever computing it, by making the optimizer equally sensitive to all state directions rather than only the high-variance ones the demonstrator happened to visit.
Importantly, the paper also shows that DoPr's hyperparameters can be reliably predicted using existing scaling heuristics from the $\mu$P (maximal update parametrization) framework (Yang et al., 2023) — meaning practitioners do not need a new expensive hyperparameter search. The drop-in nature of AP, combined with hyperparameter transfer, lowers the barrier to adoption substantially.
What's Next
The paper opens as many questions as it closes.
The most immediate is measurement. If validation loss is not a reliable proxy for downstream performance in TTF settings, what should replace it? The authors explicitly flag this as an open question. The deep learning community has invested enormously in infrastructure for tracking validation loss — dashboards, early stopping heuristics, benchmark suites. Replacing or supplementing that metric with task-specific rollout evaluation is expensive and domain-specific by nature. The field may need a new class of standardized TTF benchmarks analogous to what NanoGPT provides for pure language modeling.
There are also open questions about scale. The robotics and locomotion experiments here are conducted at relatively modest model sizes. Whether AP remains computationally tractable and empirically beneficial at the scale of production language models — where the activation covariance matrices are enormous and the training runs cost millions of dollars — is unresolved. The authors note that AP is related to KFAC (Kronecker-Factored Approximate Curvature), a method with known scalability challenges, and that their formulation uses efficient approximations. But the scaling story for DoPr at frontier LLM scale remains to be written.
The theoretical analysis also currently relies on linear dynamical systems as a "minimal example." The real-world TTF settings the paper targets — transformers, diffusion policies, convolutional robot controllers — are highly nonlinear. The intuitions transfer, and the empirics support them, but a rigorous nonlinear theory of DoPr's benefits under TTF would strengthen the foundation.
Finally, the paper gestures at but does not fully resolve the relationship between DoPr and DAgger-style interactive imitation learning. DAgger corrects the data distribution; DoPr corrects the optimizer. The two interventions are theoretically complementary — one addresses what states you train on, the other addresses how evenly you learn across those states. Whether combining them yields compounding benefits, or whether there is a regime where one dominates, is an open empirical question.
What the paper establishes clearly is a new design axis for a problem the community has mostly addressed through data and architecture. The optimizer has been the background assumption — the thing you pick from a standard menu (Adam, AdamW, SGD) and largely forget about. DoPr makes the case that the optimizer encodes implicit assumptions about what the model should learn, and that for AI systems that operate sequentially in the world, those assumptions have been quietly, consistently wrong. Fixing them does not require more data or a bigger model. It requires preconditioning twice.
These gains in test-time performance do not consistently accompany improvements in validation loss, opening new questions about how to properly evaluate models trained with one-step supervised objectives.
Sign in to join the conversation.
Comments (0)
No comments yet. Be the first to share your thoughts.