AI That Doesn’t Drift: Stable Predictions for

A new kind of neural predictor stays on track for 1,000 steps — and uses 97% less compute than standard attention

When a machine learning model tries to predict the future of a complex system — from the rhythm of gene expression in a living cell to the swing of a chaotic pendulum — it often starts strong but quickly veers off course. Tiny errors compound. Phase slips. Amplitudes decay. Trajectories spiral into nonsense. This isn’t just a theoretical flaw; it’s a practical dealbreaker for using AI in real-world control, forecasting, or safety-critical systems.

Now, a team of researchers from the University of Edinburgh and the University of Washington has built a predictor that doesn’t drift. In tests across three benchmark dynamical systems — including a synthetic gene circuit and a chaotic oscillator — their model maintains accurate, stable predictions for up to 1,000 time steps, reducing long-horizon error by up to 15 times compared to standard deep learning approaches. And it does so with 30,000 parameters — less than half the cost of conventional attention mechanisms — and linear-time computation, making it fast enough for real-time use.

The secret? Two simple, elegant ideas: an attention-free memory that corrects small errors before they grow, and a self-correcting trigger that snaps the model back onto the right path when it starts to wander. Together, they form a new kind of hybrid AI that combines the interpretability of physics-inspired models with the flexibility of deep learning — and could unlock reliable prediction in biology, climate science, and engineering.

The Science

At the heart of this work is the Koopman operator, a mathematical tool from dynamical systems theory that transforms nonlinear dynamics into linear ones — but only if you can find the right coordinate system. Think of it like flattening a crumpled piece of paper: the surface is still the same, but now you can draw straight lines across it. The Koopman operator does this for time evolution, turning complex, curving trajectories into straight-line predictions in a high-dimensional latent space.

Neural Koopman autoencoders (KAEs) learn this transformation from data: an encoder maps observations (like gene concentrations or pendulum angles) into a latent space where evolution is linear, governed by a learned matrix $K$ , and a decoder reconstructs the original state. Training involves minimizing three things: reconstruction error (can it recover what it saw?), linearity error (does the latent trajectory follow $K^{t} z_{0}$ ?), and prediction error (does it forecast correctly?).

But KAEs have a well-known weakness: they drift. Even if the first few predictions are spot-on, small inaccuracies in phase or amplitude compound over time, especially in systems with continuous spectra, switching behavior, or strong transients. The model’s latent state slowly leaves the “learned manifold” — the data-rich region it was trained on — and predictions spiral out of control.

To fix this, the researchers introduced two innovations. First, an attention-free latent memory (AFT) block, inspired by the AFT-full architecture of Zhai et al. (2021). Unlike standard multi-head attention (MHA), which computes pairwise interactions between all past and present latents — a process that scales quadratically with sequence length — AFT aggregates a short window of past latents (e.g., the last 10 steps) using a lightweight, linear-time mechanism. It computes a corrected latent $\tilde{z}_{t - 1}$ before each Koopman update:

\tilde{z}_{t - 1} = AFT (H_{t}), z_{t} = K \tilde{z}_{t - 1}

where $H_{t} = [z_{t - T}, \dots, z_{t - 1}]$ is the history. The AFT block uses learned linear projections $W_{Q}, W_{K}, W_{V}$ and a sigmoid-gated attention mechanism (Eq. 5 in the paper) that emphasizes recent context without the computational burden of full attention. It adds only $\approx 30 k$ parameters — about $3 d^{2} + T^{2}$ for $d = 100$ , $T = 10$ — making it compact and fast.

Second, they introduced dynamic re-encoding, a self-correcting mechanism that detects when the model is drifting and snaps it back onto the learned manifold. At each step, the model computes two predictions: one from the current latent $z_{t - 1}$ , and another from its projection $P (z_{t - 1}) = φ (φ^{- 1} (z_{t - 1}))$ , which forces it back through the autoencoder. The difference between these predictions — $δ_{t} = ∥ z_{t}^{re-pred} - z_{t}^{pred} ∥_{2}^{2}$ — serves as a drift detector. When $δ_{t}$ exceeds a threshold (determined by lightweight streaming algorithms like EWMA, CUSUM, or sequential two-sample tests), the model replaces $z_{t - 1}$ with $P (z_{t - 1})$ before the next step.

This is like a GPS that not only predicts your route but also checks whether you’ve strayed off the road — and if you have, it quietly snaps you back to the nearest highway exit.

The full system — KAE + AFT + dynamic re-encoding — was tested on three canonical systems: the Duffing oscillator (a bistable mechanical system with switching dynamics), the Repressilator (a synthetic gene oscillator with a stable limit cycle), and IRMA (a five-gene regulatory network in yeast, designed as a benchmark for systems biology). These were chosen because they stress long-horizon prediction in different ways: mixed spectra, phase sensitivity, and high-dimensional feedback.

What They Found

The results were striking. Across all systems, the AFT block alone outperformed matched-capacity multi-head attention (4 and 10 heads) in both accuracy and error accumulation. But when combined with dynamic re-encoding, the model achieved near-flat error growth over 1,000 steps.

On the Duffing oscillator, a system prone to chaotic switching between energy wells, the full model reduced mean squared error (MSE) to 0.0113 at 200 steps and 0.0960 at 500 steps — roughly 8–9 times lower than MHA and 3–4 times lower than AFT alone. At 1,000 steps, the improvement persisted, with dynamic re-encoding maintaining an average MSE of $0.190 \pm 0.010$ across random seeds, compared to $0.275 \pm 0.075$ for AFT alone and catastrophic failure (MSE >10) in some KAE runs.

Duffing Oscillator: Prediction Error (MSE) at 200 Steps

Duffing Oscillator: Prediction Error (MSE) at 200 Steps
Label	Value
KAE + AFT + Re-enc	0.0113
KAE + AFT	0.0427
KAE + Periodic Re-enc	0.0156
KAE	0.1286
GRU (+Ctx)	0.0862
Transformer (+Ctx)	0.1101

On the Repressilator, a clean limit-cycle oscillator, AFT alone was best — achieving an MSE of $3 \times 1 0^{- 4}$ — because the system’s regularity doesn’t require frequent re-encoding. In fact, dynamic re-encoding slightly degraded performance ($\sim 4 \times 10^{-3}$), likely by introducing unnecessary phase resets. This shows the method’s nuance: sometimes, less correction is better.

But on IRMA, a complex, feedback-rich gene circuit, dynamic re-encoding was essential. The KAE diverged completely at 1,000 steps (single-run MSE: 10.18), while AFT alone still degraded to 0.0012. The full model, however, held MSE to just 0.0003 — a 30,000-fold improvement over the baseline. Even compared to GRUs and Transformers — which were given 50 steps of context to help them settle — the Koopman+AFT model performed better or comparably, despite starting from only the initial condition.

IRMA: Prediction Error (MSE) at 1000 Steps

IRMA: Prediction Error (MSE) at 1000 Steps
Label	Value
KAE + AFT + Re-enc	0.0003
KAE + Periodic Re-enc	0.0008
KAE + AFT	0.0012
GRU (+Ctx)	0.0004
KAE	10.18
GRU (Init)	0.0102

The researchers also measured mean cumulative absolute error (MCAE), which tracks how error accumulates over time. Here, the benefits were even clearer. On IRMA, AFT reduced MCAE by 4.6–5.0× compared to MHA; on Duffing, by 4.5–4.8×. The error curves (

(d) IRMA — MCAE Source: Mohammed Nagdi, Evangelos-Marios Nikolados

) show AFT’s signature: a rapid initial drop, followed by a flat, stable trajectory — proof that local corrections prevent runaway drift.

And it’s fast. Because AFT runs in linear time and adds minimal parameters, inference latency is lower than Transformer-based models, even though it achieves far better long-horizon accuracy. This makes it suitable for real-time applications, from adaptive control to robotic planning.

Why This Changes Things

This work isn’t just about better prediction — it’s about building trust in AI for real-world systems. In fields like synthetic biology, climate modeling, or autonomous vehicles, a model that drifts is worse than useless; it’s dangerous. If a gene circuit controller loses phase, it could misfire. If a climate emulator diverges, it could mislead policy. If a robot’s internal model drifts, it could crash.

The Koopman+AFT framework offers a path to certifiable prediction: models that stay within known bounds, correct their own errors, and remain interpretable. Unlike black-box LSTMs or Transformers, the Koopman matrix $K$ is a linear operator — its eigenvalues reveal system timescales, its modes correspond to physical patterns. Adding AFT and dynamic re-encoding doesn’t obscure this; it protects it.

Consider IRMA, the yeast gene circuit. It was engineered in 2009 as a “benchmark in a bottle” — a living testbed for modeling and control. Today, synthetic biologists are designing gene therapies, biosensors, and programmable cells. But without reliable models, these systems are tuned by trial and error. This work shows that we can now build digital twins of living circuits that stay accurate for hundreds of cycles — enabling simulation-based design, fault detection, and closed-loop control.

The same principles apply to fluid dynamics, where Koopman methods have been used to model turbulence, or power grids, where switching behavior and transients are routine. A compact, low-latency predictor that doesn’t drift could enable real-time stability monitoring, anomaly detection, and adaptive control — all with minimal compute.

And unlike large language models that consume megawatts, this system uses 30,000 parameters. That’s not a typo. For context, a single attention head in a small Transformer can have hundreds of thousands. This is edge-AI-ready — deployable on microcontrollers, lab-on-a-chip devices, or satellites.

The choice between AFT and dynamic re-encoding also reveals a deeper insight: not all systems need the same kind of memory. Clean oscillators benefit from local correction (AFT); complex, switching systems need global anchoring (re-encoding). This suggests a future where AI predictors are not one-size-fits-all, but tailored to the physics of the system they model.

What’s Next

The authors acknowledge limitations. The method assumes the autoencoder manifold is well-learned — if the initial representation is poor, re-encoding won’t help. And while streaming triggers are lightweight, they require tuning (e.g., window size, threshold). The paper includes ablations over trigger policies, showing that sequential two-sample tests often outperform EWMA or CUSUM, but the best choice depends on the system.

Future work could explore adaptive triggers — machine-learned detectors that adjust their sensitivity based on system behavior — or hybrid training, where re-encoding is used during training to stabilize rollouts. There’s also potential to extend this to stochastic systems, where uncertainty quantification could guide when to re-encode.

But the bigger vision is clear: a new class of physics-informed AI that doesn’t just predict, but understands — and corrects — its own limits. In a world increasingly reliant on AI to model complex systems, that self-awareness isn’t just useful. It’s essential.

As the authors write: “The result is a fast, compact predictor that stays on the learned manifold over long horizons.” That sentence, quiet and technical, might just describe the future of trustworthy AI.

Figure 1: Workflow of the Koopman autoencoder with AFT and Dynamic Re-encoding. (a) Sampled trajectories from a Duffing Oscillator serve as input. (b) The core Koopman autoencoder learns a linear latent representation by minimizing reconstruction, linearization, and prediction losses. (c) The prediction process uses a Dynamic Re-encoding module with AFT attention to refine the latent state (zt→z~tz_{t}\to\tilde{z}_{t}), which is then evolved by the learned Koopman operator K. (d) The final output shows predicted trajectories matching the reference dynamics. Source: Mohammed Nagdi, Evangelos-Marios Nikolados

(b) IRMA — multi-traj Source: Mohammed Nagdi, Evangelos-Marios Nikolados

AI That Doesn’t Drift: A 1,000-Step Predictor With 97% Less Compute