FASTER: Teaching Robot Brains to Think Smarter, Not Harder
A new RL method cuts robot training and inference costs dramatically by learning to prune bad action candidates mid-thought — before the AI even finishes decidi
FASTER matches top-performing robot AI at a fraction of the compute cost.
Opening
There's a quiet arms race happening inside robotics labs right now. The most capable robot-learning systems — the ones that can fold laundry, stack objects, and manipulate tools — have gotten good partly by thinking longer. At inference time, rather than committing to a single action, they generate dozens of candidate actions and pick the best one. It's the AI equivalent of a chess grandmaster playing out multiple lines in their head before moving a piece. Powerful, yes. But also expensive — sometimes ruinously so.
FASTER, proposed by Perry Dong, Alexander Swerdlow, Dorsa Sadigh, and Chelsea Finn at Stanford (Dong et al., 2026), offers a different path. Instead of waiting until all the candidates are fully formed before picking one, FASTER teaches the AI to start eliminating bad options while it's still thinking. The result is a system that achieves the performance of expensive multi-sample methods at a fraction of the computational cost — a genuine case of working smarter, not harder.
The Science
To understand why FASTER matters, you first need to understand how modern generative robot policies work — and why the best ones have become so compute-hungry.
The breakthrough in robot learning over the past few years has come largely from diffusion policies — a class of AI models borrowed from image generation (think Stable Diffusion or DALL·E, but generating robot movements instead of pictures). A diffusion model works by starting with random noise and progressively "denoising" it into a clean, coherent output. For robot policies, that output is an action sequence: a precise choreography of joint positions and forces that accomplishes a task.
These diffusion policies are trained using reinforcement learning (RL), where a robot learns by trial and error, receiving reward signals when it succeeds and updating its behavior accordingly. Some of the most performant RL algorithms augment this with test-time scaling: at inference time, the model generates N candidate action sequences simultaneously and then uses a learned value function — essentially a critic that scores how good each action is expected to be — to select the best one. This "Best-of-N" sampling has delivered real performance gains, but running a full denoising process for N=16 or N=32 candidates every time the robot needs to act is expensive both in computation and time.
FASTER's central insight is elegant: most of the information needed to judge whether an action is good or bad becomes available well before the denoising is complete. A partially denoised action candidate — still noisy, not yet executable — already carries enough signal to distinguish promising candidates from duds. The question is how to exploit that signal systematically.
The researchers' answer is to reframe the entire selection process as its own Markov Decision Process (MDP) — a mathematical framework for sequential decision-making under uncertainty. An MDP is the standard formalism for RL problems: an agent takes actions in states, receives rewards, and learns a policy (a mapping from states to actions) that maximizes cumulative reward. By treating each step of the denoising process as a state and the act of filtering candidates as an action, Dong et al. construct a meta-level RL problem inside the denoising loop. The agent in this inner MDP doesn't control the robot; it controls which candidate action sequences survive to the next denoising step.
Within this MDP, FASTER learns both a policy (which candidates to eliminate at each denoising step) and a value function (an estimate of how good a partially denoised candidate will ultimately turn out to be). This value function is the key workhorse: it gives FASTER the ability to prune candidates confidently at early, cheap denoising steps rather than waiting until the full, expensive process is complete.
Crucially, FASTER is designed to be modular. It plugs into existing generative RL frameworks — the researchers demonstrate compatibility with DPPO (Diffusion Policy Policy Optimization) for online RL and IDQL for batch-online RL — without requiring those frameworks to be redesigned from the ground up. The method is trained jointly with the underlying policy, learning to filter in the same denoising space where actions are generated.
What They Found
The researchers evaluated FASTER on a suite of challenging long-horizon robotic manipulation tasks
— the kind where a robot must execute a precise sequence of sub-tasks to succeed, and where a single mistake early in the sequence can doom the entire episode. These tasks are a demanding testbed because they require sustained competence across many steps, not just a single dexterous move.
Across both online RL settings (where the robot learns by interacting with the environment in real time) and batch-online RL settings (where learning begins from a fixed dataset before additional online refinement), FASTER consistently achieved the best overall performance among all compared methods (Dong et al., 2026). This held across multiple task environments, suggesting the improvements are robust rather than cherry-picked.
FASTER vs. Baselines: Online RL Task Performance
FASTER consistently achieves the best overall performance across long-horizon manipulation tasks compared to competing methods in the online RL setting.
| Label | Value |
|---|---|
| Base Policy | 1 |
| Best-of-N | 2 |
| Value-Guided (Alt.) | 2 |
| FASTER | 3 |
The comparison set included not just the base diffusion policy baselines but also alternative approaches to value-guided action selection — methods that also try to leverage value functions at inference time but in less structured ways. FASTER outperformed these alternatives, which the researchers attribute to its explicit MDP formulation: by treating filtering as a proper sequential decision problem, FASTER learns a more principled and consistent pruning strategy.
FASTER vs. Baselines: Batch-Online RL Task Performance
FASTER also leads in the batch-online RL setting, where learning begins from a fixed dataset before online refinement.
| Label | Value |
|---|---|
| Base Policy | 1 |
| Best-of-N | 2 |
| Value-Guided (Alt.) | 2 |
| FASTER | 3 |
Perhaps the most practically significant result comes from the VLA (Vision-Language-Action model) experiments. VLAs — large pretrained models that take in camera images and language instructions and output robot actions — represent the frontier of generalist robot AI. They're also extremely expensive to train and run. When FASTER was applied to a pretrained VLA, it achieved the same task performance as the full Best-of-N baseline while substantially reducing both training and inference compute requirements (Dong et al., 2026). This is a meaningful result: it suggests that FASTER's efficiency gains aren't just a laboratory curiosity but translate to the large-scale systems that are closest to real-world deployment.
FASTER on VLA: Performance vs. Compute Trade-off
Applied to a pretrained Vision-Language-Action model, FASTER matches Best-of-N performance while requiring substantially less training and inference compute.
| Label | Value |
|---|---|
| Best-of-N (Full Compute) | 3 |
| FASTER (Reduced Compute) | 3 |
The value function FASTER learns generalizes across denoising steps in a useful way. At very early denoising steps, the partially denoised candidates are still quite noisy — resembling action-shaped blobs more than precise trajectories. Even so, the value function can already separate the roughly-promising from the roughly-hopeless, allowing early, aggressive pruning. By mid-denoising, the value function's predictions sharpen, and FASTER begins eliminating candidates with greater confidence. By the final steps, only a small number of high-quality candidates remain, and the overhead of completing their denoising is minimal. The result is that FASTER does most of its heavy thinking early and cheaply, rather than running every candidate all the way to completion before choosing.
Why This Changes Things
The practical implications go beyond benchmark numbers. Compute costs are one of the most significant barriers to deploying capable robot AI at scale. Training a high-performance diffusion policy with Best-of-N sampling already requires substantial GPU resources; inference with N=16 or N=32 candidates in real time on a physical robot is often simply infeasible with commodity hardware. FASTER's ability to approximate the performance of Best-of-N sampling while using a fraction of the compute opens up a genuine path toward deploying capable, RL-trained robots on hardware that doesn't require a data center.
There's also a deeper conceptual contribution here. The framing of action selection as an MDP is more than a clever trick — it's a principled way of thinking about a problem that the field has mostly approached heuristically. Previous methods for value-guided inference in diffusion policies have tended to apply value functions at a single point (usually near the end of denoising) or as a weighting signal rather than a structured sequential filter. By building the full MDP machinery — with a policy, a value function, and proper RL training — FASTER treats the filtering problem with the same rigor as the underlying robot control problem. That methodological clarity is likely to be generative: it gives other researchers a well-defined framework to build on.
The plug-in nature of FASTER is also worth emphasizing. The robot learning ecosystem has fragmented in recent years, with dozens of competing architectures and training paradigms. A method that improves performance across multiple underlying frameworks — without requiring those frameworks to be modified — has a much wider blast radius than one that is tightly coupled to a single approach. DPPO and IDQL are quite different algorithms; the fact that FASTER improves both suggests the method is touching something general rather than exploiting a quirk of one particular system (Dong et al., 2026).
Finally, the VLA results matter for reasons that extend beyond compute savings. VLAs are increasingly seen as the path toward robots that can follow arbitrary natural-language instructions — the kind of general-purpose robotic assistants that have been promised for decades. But VLAs are expensive to fine-tune and expensive to run. A method that makes them more efficient at inference time, with no loss of performance, directly accelerates the timeline to practical deployment. Even modest compute reductions compound significantly at the scale of large robotic fleets.
What's Next
The honest caveats matter too. FASTER's results, while compelling, come from simulated environments and a limited set of manipulation tasks. The long-horizon tasks used in the benchmarks are genuinely challenging, but they are still structured environments with defined reward signals — a far cry from the messy, reward-sparse real world where most robots will eventually need to operate. Whether FASTER's value function learns representations that transfer to real physical hardware, with all its sensor noise and contact dynamics, remains to be demonstrated.
There's also an open question about how FASTER scales as N — the initial number of action candidates — grows large. The paper demonstrates gains in the regimes most commonly used in the literature, but the relationship between N, compute savings, and performance is not yet fully characterized. It's possible that at very large N, different pruning strategies or value function architectures would be needed.
The MDP framing also opens up new questions that the paper doesn't fully answer. The inner MDP that FASTER solves — filter or don't filter at each denoising step — is itself a sequential decision problem that could in principle be solved more or less optimally. FASTER uses a relatively straightforward RL approach within this inner MDP, but richer formulations (perhaps with learned stopping criteria, or with more sophisticated exploration during training) might yield further gains. The framework is, in a sense, a new design space, and this paper has only explored one corner of it.
The intersection of diffusion models and RL is one of the most active areas in AI right now, and FASTER arrives at a moment when the field is hungry for efficiency solutions. As VLAs and large diffusion policies become the default architecture for capable robots, the methods used to scale them at inference time will become increasingly consequential. A method that makes test-time scaling cheaper and more principled — and that works as a drop-in improvement for existing systems — is exactly the kind of infrastructure research that tends to have outsized long-term impact.
The code is publicly available at github.com/alexanderswerdlow/faster, which lowers the barrier to replication and extension considerably. In a field where methods can be difficult to reproduce, that openness matters.
The vision of robots that can reliably help with complex, multi-step physical tasks — assembling furniture, assisting in surgery, working in warehouses — depends not just on AI that is capable in principle but on AI that is efficient enough to be deployed at scale. FASTER doesn't solve that problem. But it meaningfully moves the goalpost, demonstrating that the performance ceiling of best-in-class RL methods need not be locked behind best-in-class compute budgets. Sometimes the most important advance is learning to think faster, not just harder.
Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process where the goal is to progressively filter action candidates before denoising is complete.
Sign in to join the conversation.
Comments (0)
No comments yet. Be the first to share your thoughts.