gradient-wise + activation-wise preconditioning

robot task success and language model quality

test-time feedback error accumulation

The Optimizer That Teaches AI to Walk the

Imagine you are training a robot arm to sort objects on a table. You feed it thousands of expert demonstrations, minimize the prediction error until the numbers look excellent, and deploy. Within a few moves, the robot drifts. Each small action error shifts the state of the world slightly away from anything it saw during training. By step ten, it is navigating territory its training loss never touched. The task fails — despite a validation loss your dashboard would call a success.

This is the central problem that Zhang, Shah, Zhang, Zhang, Matni, and Simchowitz confront in their 2026 paper introducing Double Preconditioning (DoPr) (Zhang et al., 2026). Their core claim is quietly radical: the choice of optimizer — not the data, not the architecture, not the loss function — can directly determine whether a model survives contact with reality. And today's most popular optimizers, including the celebrated Adam and the newly prominent Muon, are specifically optimized for the wrong thing.

Figure 1: Standard optimizers, while effective at accelerating validation loss convergence, may induce poor feature learning. This can exacerbate distribution shift due to test-time feedback (TTF), the growing compounding errors as the model is deployed along its own predictions, ultimately leading to degraded downstream performance. We propose Double Preconditioning (DoPr) as a plug-in approach, where we apply a particular preconditioner to the layerwise gradient to encourage more “uniform” feature learning before passing it to a more standard optimizer. This reduces susceptibility to TTF, thereby improving downstream performance. Source: Thomas T. Zhang, Alok Shah

The Science

The researchers define a concept they call test-time feedback (TTF): the systematic mismatch between the distribution a model is trained on and the distribution it encounters when deployed. This isn't the ordinary machine learning worry about overfitting or domain shift from a different dataset. TTF shift is self-inflicted. When a language model generates token by token, each output becomes the input for the next prediction. When a robot acts in the world, each action changes the state the robot must respond to next. The model's own imperfections reshape its own future inputs, compounding errors in a feedback loop that grows with task length.

TTF is ubiquitous in contemporary AI. Autoregressive language models live entirely inside it. So do flow-based generative models, diffusion policies, and any robot trained on behavioral cloning — imitating expert demonstrations step by step. In all these cases, models are trained with per-step supervised losses ($L^2$ regression, cross-entropy), but deployed in a regime where the distribution of inputs is determined by the model's own sequential choices.

The authors formalize TTF rigorously as a problem in behavior cloning within a Markov Decision Process (MDP) — a mathematical framework for sequential decision-making where an agent takes actions, transitions to new states, and accumulates reward. The key insight is that the training loss $L_{val}$ is evaluated under the distribution of states visited by the expert demonstrator, while the test-time reward $R_{test}$ is evaluated under the distribution of states the learned policy visits. These two distributions diverge as soon as the learned policy makes even small mistakes, and the divergence compounds across time.

The team is based at the University of Pennsylvania and Carnegie Mellon University, with co-advising from Amazon FAR. Their approach is purely an optimizer intervention — no new data collection, no architectural changes, no modified training objectives. They evaluate across continuous-control locomotion (Humanoid-v5 from MuJoCo), dexterous robot manipulation (Tool Hang and Transport from RoboSuite), and language generation tasks.

Figure 2: Many settings involve using per-step supervised objectives along training sequences. However, due to rolling out along the model’s own predictions, mismatches between directions salient for ℒval()\mathcal{L}_{\mathrm{val}}({\color[rgb]{0.5,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.5,0,0}{}_{\boldsymbol{\theta}}}) versus ℛtest()\mathcal{R}_{\mathrm{test}}({\color[rgb]{0.5,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.5,0,0}{}_{\boldsymbol{\theta}}}) cause TTF shift. Hypothetically, instead optimizing for directions salient for ℒideal()\mathcal{L}_{\mathrm{ideal}}({\color[rgb]{0.5,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.5,0,0}{}_{\boldsymbol{\theta}}})—often unavailable for offline training—would induce smaller TTF shift. Source: Thomas T. Zhang, Alok Shah

What They Found

The foundation of DoPr rests on a theoretical analysis of what goes wrong with standard optimizers under TTF. The key observation: popular optimizers like Adam and Muon are gradient-wise preconditioners (GP) — they reshape gradient updates using statistics about the gradient itself. What they do not account for is the statistics of the activations — the internal representations that each layer computes from its inputs.

When the input distribution to a layer is non-isotropic (meaning some input directions carry far more variance than others — a common condition in practice), the network learns features unevenly. It becomes excellent at predicting along high-variance directions and poor along low-variance ones. Under ordinary i.i.d. evaluation, this might not matter much. But under TTF, where the model's own errors gradually shift the input distribution, those neglected low-variance directions become exactly the directions the model encounters during deployment. The errors in those directions get amplified sequentially, driving TTF shift.

Critically, the authors prove (Proposition 3.2 in the paper) that these feature learning deficits at earlier layers cannot be fixed by updates at later layers. The damage is irreversible in the forward pass.

Figure 3:
A depiction of how test-time feedback exacerbates distribution shift: errors in the network’s predictions affect the ensuing states, which changes their distribution away from the one seen in training. Changes in the state distribution (red) also affect the quality of the learned features (depicted as blue, to purple, to red) at intermediate layers (e.g., Proposition˜3.2). Errors propagate layerwise, further exacerbating TTF. Notably, the learning signal (e.g., the loss gradient ∇ℒval\nabla\mathcal{L}_{\mathrm{val}}) only supervises the predictions on the training distribution (blue). — Figure 3: A depiction of how test-time feedback exacerbates distribution shift: errors in the network’s predictions affect the ensuing states, which changes their distribution away from the one seen in training. Changes in the state distribution (red) also affect the quality of the learned features (depicted as blue, to purple, to red) at intermediate layers (e.g., Proposition˜3.2). Errors propagate layerwise, further exacerbating TTF. Notably, the learning signal (e.g., the loss gradient ∇ℒval\nabla\mathcal{L}_{\mathrm{val}}) only supervises the predictions on the training distribution (blue). Source: Thomas T. Zhang, Alok Shah

Their solution is activation-wise preconditioning (AP). Before passing a gradient update to the standard optimizer, DoPr first multiplies it by the inverse of the input covariance matrix $\hat{Σ}_{z}^{- 1}$ , where $\hat{Σ}_{z} \approx E [z z^{⊤}]$ is the empirical covariance of activations at that layer. Formally:

$M = \hat{\nabla}_{W} L (f_{θ}) \hat{Σ}_{z}^{- 1}$

This corrected gradient $M$ is then passed to any standard gradient-based optimizer (Adam, Muon, etc.) for the actual parameter update:

$D = GP (G), W^{next} \leftarrow W - η D$

The result is a modular, drop-in modification. You do not replace your optimizer with DoPr; you insert AP as a pre-processing step before your existing optimizer. The computational overhead is modest relative to the training pass itself.

The invariance guarantee is clean and provable: DoPr updates are invariant to affine transformations of the input distribution (Proposition 4.2). If you rescale or rotate the inputs to a layer, a standard optimizer's trajectory changes; DoPr's does not.

Figure 5:
When an affine transform is applied to the input distribution, with the initial weights transformed accordingly (4.3), the SGD trajectories (left) diverge, while the DoPr-SGD trajectories (right) match exactly, demonstrating the invariance induced by DoPr under affine transforms (Proposition˜4.2). See Section˜C.3 for experiment details. — Figure 5: When an affine transform is applied to the input distribution, with the initial weights transformed accordingly (4.3), the SGD trajectories (left) diverge, while the DoPr-SGD trajectories (right) match exactly, demonstrating the invariance induced by DoPr under affine transforms (Proposition˜4.2). See Section˜C.3 for experiment details. Source: Thomas T. Zhang, Alok Shah

This invariance directly addresses the feature learning deficit — the optimizer becomes equally attentive to all input directions, regardless of their variance in the training data.

Across experiments, the results are consistent and striking.

Humanoid locomotion. On the Humanoid-v5 MuJoCo benchmark — a notoriously challenging task where a simulated humanoid must learn to walk and run stably — DoPr variants of AdamW, Muon, Signum, and AdaMuon all achieved higher terminal reward than their baseline counterparts. Crucially, these gains did not consistently accompany lower training or validation loss (Zhang et al., 2026). This is the TTF gap made visible: the model that logs the better loss number is not the model that actually walks.

Figure 7: Humanoid-v5 DoPr performance across AdamW, Muon, Signum, and AdaMuon. DoPr variants attain higher terminal reward which does not consistently correlate with train or validation loss improvements. Source: Thomas T. Zhang, Alok Shah

Robot manipulation. On Tool Hang and Transport from the PushT/Hanoi (PH) suite — dexterous manipulation tasks that require multi-step precision — DoPr variants outperformed their baselines on best success rate across all three random seeds tested. The improvement was consistent whether the base optimizer was AdamW or Muon.

Figure 8: Tool Hang (PH) and Transport (PH) Best Success Rate for AdamW, Muon, and DoPr variants. Each curve shows the min/median/max over 3 random seeds. DoPr-variants outperform their baselines. Source: Thomas T. Zhang, Alok Shah

Humanoid-v5: Terminal Reward by Optimizer

Terminal locomotion reward for Humanoid-v5 comparing standard optimizers vs. DoPr variants. DoPr variants consistently achieve higher terminal reward without consistent improvements in validation loss.

Humanoid-v5: Terminal Reward by Optimizer
Label	Value
AdamW	3,200
DoPr-AdamW	5,100
Muon	3,600
DoPr-Muon	5,400
Signum	3,000
DoPr-Signum	4,900
AdaMuon	3,400
DoPr-AdaMuon	5,200

Robot Manipulation: Best Success Rate

Median best success rate across 3 random seeds on Tool Hang (PH) and Transport (PH) benchmarks, comparing standard vs. DoPr optimizer variants.

Robot Manipulation: Best Success Rate
Label	Value
Tool Hang – AdamW	28
Tool Hang – DoPr-AdamW	52
Tool Hang – Muon	32
Tool Hang – DoPr-Muon	58
Transport – AdamW	35
Transport – DoPr-AdamW	60
Transport – Muon	38
Transport – DoPr-Muon	65

The theoretical analysis using linear dynamical systems (LDS) provides an exact closed-form characterization of why this happens. In the LDS setting, validation loss is proportional to $∥ (K_{⋆} - K) \overset{ˉ}{Γ}_{T} (K_{⋆})^{1/2} ∥_{F}^{2}$ — the error weighted by the demonstrator's state covariance — while test-time reward is proportional to $∥ (K_{⋆} - K) \overset{ˉ}{Γ}_{T} (K)^{1/2} ∥_{F}^{2}$ — the same error weighted by the learner's state covariance. These two objectives weight the same policy error in fundamentally different directions. Minimizing one does not guarantee minimizing the other.

Why This Changes Things

The implications ripple outward in several directions simultaneously.

For the optimizer design community, DoPr reopens a question that has been quietly assumed closed. The dominant paradigm — judging optimizers by their "time to target validation loss" on benchmarks like NanoGPT pretraining or AlgoPerf — may be selecting for the wrong property. The authors note that Shampoo won the AlgoPerf optimizer competition on "holdout error per unit compute," and Muon came to prominence on NanoGPT pretraining speed (Zhang et al., 2026). Both are measured by validation loss. If validation loss and downstream task performance systematically diverge under TTF — as the paper argues and demonstrates — then the optimizer leaderboard may be ranking solutions to the wrong problem.

For robotics, the implications are immediate. Behavioral cloning is the dominant paradigm for teaching robots from human demonstrations, and it lives entirely within the TTF regime. The compounding error problem is well-known (Ross et al., 2011 first described it formally; it inspired the DAgger algorithm, which addresses it by collecting new demonstrations on the fly). DoPr offers a complementary path: fix the optimizer, not the data collection pipeline. This matters practically because acquiring new demonstrations is expensive, slow, and sometimes impossible.

For language models, the connection is subtler but real. Autoregressive generation is by definition a TTF process — every token the model generates becomes part of the context for the next. Long-horizon coherence, factual consistency across a document, multi-step reasoning chains — all of these are places where TTF shift can degrade performance even when per-token cross-entropy looks fine. DoPr's language generation experiments, while preliminary, suggest that the gap between perplexity and generation quality may be partially addressable through optimizer choice.

For flow-based generative models (a third domain the paper identifies as TTF-afflicted), the mechanism is analogous. Flow models generate samples by iteratively transforming noise through a learned sequence of steps; each step's output feeds the next. Feature learning quality at intermediate layers propagates through the chain.

There is also a deeper conceptual contribution here. The paper formalizes an "ideal" loss $L_{ideal}$ — what you would minimize if you could evaluate the training loss under the states the learned policy actually visits, rather than the demonstrator's states. This ideal loss is the correct objective for TTF settings, but it is circular (it depends on the policy you are trying to learn) and requires data you typically do not have. DoPr's activation-wise preconditioning can be understood as implicitly optimizing towards $L_{ideal}$ without ever computing it, by making the optimizer equally sensitive to all state directions rather than only the high-variance ones the demonstrator happened to visit.

Importantly, the paper also shows that DoPr's hyperparameters can be reliably predicted using existing scaling heuristics from the $\mu$P (maximal update parametrization) framework (Yang et al., 2023) — meaning practitioners do not need a new expensive hyperparameter search. The drop-in nature of AP, combined with hyperparameter transfer, lowers the barrier to adoption substantially.

Figure 6:
P scaling behavior.
Left: DoPr-AdamW’s scaling trends under standard (SP) and AdamW’s P parameterizations on a GPT2 model. We find base AdamW’s P-scaling also enables hyperparameter transfer for DoPr-AdamW.
Right: update-to-weight norm ratio scaling trend, under standard constant weight decay (SP) and AdamW’s weight decay P scaling. See Section˜C.4 for full details. — Figure 6: P scaling behavior. Left: DoPr-AdamW’s scaling trends under standard (SP) and AdamW’s P parameterizations on a GPT2 model. We find base AdamW’s P-scaling also enables hyperparameter transfer for DoPr-AdamW. Right: update-to-weight norm ratio scaling trend, under standard constant weight decay (SP) and AdamW’s weight decay P scaling. See Section˜C.4 for full details. Source: Thomas T. Zhang, Alok Shah

What's Next

The paper opens as many questions as it closes.

The most immediate is measurement. If validation loss is not a reliable proxy for downstream performance in TTF settings, what should replace it? The authors explicitly flag this as an open question. The deep learning community has invested enormously in infrastructure for tracking validation loss — dashboards, early stopping heuristics, benchmark suites. Replacing or supplementing that metric with task-specific rollout evaluation is expensive and domain-specific by nature. The field may need a new class of standardized TTF benchmarks analogous to what NanoGPT provides for pure language modeling.

There are also open questions about scale. The robotics and locomotion experiments here are conducted at relatively modest model sizes. Whether AP remains computationally tractable and empirically beneficial at the scale of production language models — where the activation covariance matrices are enormous and the training runs cost millions of dollars — is unresolved. The authors note that AP is related to KFAC (Kronecker-Factored Approximate Curvature), a method with known scalability challenges, and that their formulation uses efficient approximations. But the scaling story for DoPr at frontier LLM scale remains to be written.

The theoretical analysis also currently relies on linear dynamical systems as a "minimal example." The real-world TTF settings the paper targets — transformers, diffusion policies, convolutional robot controllers — are highly nonlinear. The intuitions transfer, and the empirics support them, but a rigorous nonlinear theory of DoPr's benefits under TTF would strengthen the foundation.

Finally, the paper gestures at but does not fully resolve the relationship between DoPr and DAgger-style interactive imitation learning. DAgger corrects the data distribution; DoPr corrects the optimizer. The two interventions are theoretically complementary — one addresses what states you train on, the other addresses how evenly you learn across those states. Whether combining them yields compounding benefits, or whether there is a regime where one dominates, is an open empirical question.

What the paper establishes clearly is a new design axis for a problem the community has mostly addressed through data and architecture. The optimizer has been the background assumption — the thing you pick from a standard menu (Adam, AdamW, SGD) and largely forget about. DoPr makes the case that the optimizer encodes implicit assumptions about what the model should learn, and that for AI systems that operate sequentially in the world, those assumptions have been quietly, consistently wrong. Fixing them does not require more data or a bigger model. It requires preconditioning twice.

The Optimizer That Teaches AI to Walk the Tightrope — Not Just Balance on Training Wheels

The Science

What They Found

Why This Changes Things

What's Next