Meridia Insight Tech for Good Frontiers

Teaching Robots to Think Shorter: How AI-Learned Shortcuts Could Unlock Real-Time Control of Complex Systems

A satellite controller that normally takes thousands of seconds to plan ahead can now make decisions in real time — by learning from an expert what the future i

360 expert demos replaced a horizon-30 MINMPC with a 1-step controller — matching performance in real time.

One of the most stubborn problems in engineering control is deceptively simple to state: how do you make a machine plan wisely in real time when its actions are discrete? Not "turn the thruster to 37%," but "fire or don't fire." Not a dial, but a switch. That binary character — shared by satellite thrusters, fishing quotas, insulin pumps, power converters, and dozens of other real-world actuators — turns what would otherwise be a manageable optimization problem into something that belongs to the class of problems mathematicians call $\mathcal{NP}$-hard. The computational cost doesn't just grow with the planning horizon; it explodes exponentially.

The standard engineering solution — model predictive control, or MPC — works by solving a finite-horizon optimization at every time step, implementing only the first action, and then re-solving from the new state. It is powerful and widely deployed. But when your actions are integers rather than continuous numbers, each time step demands solving a mixed-integer nonlinear program (MINLP): a beast that can take hours to crack for modest-sized problems. For a satellite that needs a fresh decision every half-second, this is simply not viable.

A new paper by Christopher Orrico, W. P. M. H. Heemels, and Dinesh Krishnamoorthy from Eindhoven University of Technology and the Norwegian University of Science and Technology proposes an elegant escape route (Orrico et al., 2026). Instead of solving the full long-horizon problem online, they teach the controller what the future is worth — offline, from examples — and then let the online planner look just one step ahead, using that learned knowledge as a compass. The result is a "myopic" controller that matches the performance of a 30-step planner while solving a problem that is orders of magnitude smaller.

The Science

The technical framework sits at the intersection of three ideas: model predictive control, Bellman's principle of optimality, and inverse optimization.

MPC, in its standard form, solves an optimal control problem at each time step: minimize accumulated cost over a prediction horizon of steps, subject to dynamics and constraints. For systems with discrete actuators — captured by integer variables alongside continuous controls — this becomes a mixed-integer nonlinear MPC (MINMPC) problem. Longer horizons give better performance but make the problem exponentially harder to solve.

Bellman's principle of optimality offers a theoretical key. It says you can split any long-horizon problem into a short piece you solve explicitly and a "tail" piece you represent as a value function — the optimal cost-to-go from whatever state you reach at the end of the short piece. If you know that value function exactly, a one-step planner is provably equivalent to the full $N$-step planner. The catch: computing the true value function for all possible states is itself intractable, sometimes described as the "curse of dimensionality."

The authors' innovation is to learn an approximate value function — parameterized by a matrix , giving a quadratic form — from offline expert demonstrations. Those demonstrations are state-action pairs drawn from the expensive, long-horizon MINMPC solver. Rather than asking "what is the value at this state?" (which the expert doesn't directly tell you), they ask: "what value function would make the expert's actions look optimal?"

This is the province of inverse optimization. The specific tool they use is KKT residual minimization — KKT standing for Karush-Kuhn-Tucker, the mathematical conditions that characterize an optimal solution to a smooth optimization problem. In plain terms: find the parameters such that the expert's observed actions come as close as possible to satisfying the optimality conditions of the one-step planning problem.

Here is where the central technical puzzle arises. KKT conditions require calculus — differentiability, gradients, Lagrange multipliers. Integer constraints break all of that. A problem with has no gradients at integer boundaries; standard KKT theory simply doesn't apply. The authors resolve this with a deliberate "dual treatment": during offline learning, integer constraints are relaxed so that is treated as a continuous variable. This opens the door to KKT-based learning. Then, when the learned value function is deployed online, the integer constraints are fully reinstated. The controller remains genuinely mixed-integer and feasible; the learning phase just used a smooth surrogate to extract the structure of what the future costs.

Critically, the authors also show that even the offline expert demonstrations themselves can be generated from continuous relaxations of the MINMPC problem — meaning you never have to solve expensive MINLPs just to build your training set. This is a significant practical relief.

The resulting inverse optimization problem — finding the best — takes the form of a semidefinite program (SDP), a convex optimization class that solvers handle efficiently. By restricting to the quadratic family, the authors deliberately limit the complexity of the learned function, which improves sample efficiency: you don't need millions of demonstrations to recover something useful.

What They Found

The framework was validated on two benchmark problems. The first is the Lotka-Volterra fishing problem — a classic in mixed-integer optimal control, describing a predator-prey ecosystem where a manager must decide, at each time step, whether to fish (z=1$) or not ($z=0$), with the goal of stabilizing both populations near a target level. The dynamics are genuinely nonlinear: prey population $x_1 grows logistically, predator depends on prey, and fishing affects both with different rate coefficients ($c_1 = 0.4$, c_2 = 0.2$). The full MINMPC uses a prediction horizon of $N=30 steps.

To train the value function, the authors ran the full expert MINMPC from three different initial states, collecting state-action pairs. Generating this training data took a total of seconds — more than two hours. The SDP then produced the matrix:

The maximum KKT stationarity residual was and the complementary slackness residual was — both tiny relative to the scale of , indicating the expert demonstrations are nearly optimal under the learned value function.

KKT Residuals vs. Matrix Scale (Lotka-Volterra)

Maximum KKT residuals from the learned value function compared to the scale of the learned matrix P entries, showing near-zero residuals relative to learned parameters.

KKT Residuals vs. Matrix Scale (Lotka-Volterra)
LabelValue
P[1,1] (0.5965)0.5965
P[1,2] (0.4627)0.4627
P[2,2] (0.3589)0.3589
Max Stationarity Residual0.000097
Max Complementarity Residual0.0000967

Deployed online with a horizon of — just one step ahead — and with 10% plant-model mismatch (the true fishing coefficients differed from the model's), the myopic MINMPC tracked the reference population of nearly as well as the full 30-step controller. Crucially, a one-step MINMPC without the learned value function failed to drive the populations to the reference — confirming that the learned cost-to-go, not the planning depth, is doing the work.

Figure 1: (a) Prey and (b) predator state populations for the expert demonstrations (in light blue), MINMPC controller (in blue), and myopic MINMPC controller (in red). The (c) control decisions and (d) the computation time per controller decision.
Figure 1: (a) Prey and (b) predator state populations for the expert demonstrations (in light blue), MINMPC controller (in blue), and myopic MINMPC controller (in red). The (c) control decisions and (d) the computation time per controller decision. Source: Christopher Anthony Orrico, W. P. M. H. Heemels

The second test involved a satellite in orbit, modeled in polar coordinates with three states — orbital radius , radial velocity , and angular velocity — and two discrete thrusters, each capable of firing in three modes: . The satellite must maintain a safe orbital altitude ($3 \leq x_1 \leq 7$) and reach a target orbit. With and two ternary actuators, the full MINMPC is substantially harder than the fishing problem. The myopic controller, again with , maintained safe orbital bounds and matched the trajectory quality of the full expert in closed-loop simulation.

Figure 2: (a) Polar plot of the satellite trajectories for the offline expert demonstrations (light blue), MINMPC (blue), and myopic MINMPC (red). Each initial state is denoted by a circle. (b) z1z_{1} and (c) z2z_{2} for each controller. (d) computation time per controller decision.
Figure 2: (a) Polar plot of the satellite trajectories for the offline expert demonstrations (light blue), MINMPC (blue), and myopic MINMPC (red). Each initial state is denoted by a circle. (b) z1z_{1} and (c) z2z_{2} for each controller. (d) computation time per controller decision. Source: Christopher Anthony Orrico, W. P. M. H. Heemels

Training Data Size vs. Horizon: Myopic vs. Expert MINMPC

Comparison of key controller parameters: the expert MINMPC uses a 30-step horizon, while the myopic controller uses only 1 step ahead, trained on 360 demonstrations.

Training Data Size vs. Horizon: Myopic vs. Expert MINMPC
LabelValue
Expert MINMPC Horizon (steps)30
Offline Demonstrations (M)360
Offline Data Generation Time (s, ÷100)84.2

Why This Changes Things

To appreciate the significance, it helps to understand what real-time control actually demands. A satellite attitude controller that fires thrusters every 0.5 seconds cannot wait hours — or even seconds — for an optimizer to finish. The same applies to power electronics switching at kilohertz frequencies, or insulin delivery systems responding to glucose spikes. These applications have been largely out of reach for principled mixed-integer MPC precisely because solving MINLPs in real time is intractable.

Previous workarounds have involved approximations that often sacrifice accuracy (rounding relaxed solutions), architectures that sacrifice interpretability (deep neural networks approximating the full policy), or methods that require extensive interaction with the real system (reinforcement learning). Each carries a cost. Rounding can produce infeasible or suboptimal actions. Neural network policy approximation requires enormous offline datasets covering the full state space — and any update to the controller demands retraining. Reinforcement learning requires trial-and-error on the real system, which is unacceptable when mistakes are expensive or dangerous.

The myopic MINMPC framework avoids all three pitfalls. The learned object is not a full policy but a value function — a single matrix in the quadratic case. It requires only a modest number of expert demonstrations ($M = 360$ sufficed for the fishing problem). And because the online controller is still an explicit optimization problem with constraints, it inherits the interpretability and feasibility guarantees of MPC. Constraints are satisfied by construction; the controller can be inspected and reasoned about.

The conceptual move of relaxing integers during learning but reinstating them online is particularly clean. It avoids the theoretical dead-end of applying KKT conditions to discrete problems, while preserving the practical integrity of the controller. As the authors note, this is consistent with how nonlinear MPC is routinely solved in practice anyway — numerical solvers find KKT-satisfying stationary points, which are necessary but not sufficient for global optimality. The framework just extends this pragmatic posture to the learning phase.

Myopic MINMPC: Multi-Dimensional Performance Profile

Qualitative assessment of the myopic MINMPC framework across key engineering dimensions relative to baseline approaches, based on findings reported in the paper.

Myopic MINMPC: Multi-Dimensional Performance Profile
LabelValue
Real-Time Feasibility9
Closed-Loop Performance8
Data Efficiency8
Interpretability9
Constraint Satisfaction9
Offline Cost6

There is also a subtle result embedded in the paper's analysis: the learned policy is approximately "policy-consistent" with the expert. This is a formal notion — as the number of demonstrations , the probability that the learned policy deviates from the expert's policy by more than goes to zero. In finite-data practice, the near-zero KKT residuals observed in both benchmarks serve as practical evidence of this consistency.

What's Next

The paper is honest about its limitations, and they point toward a rich research agenda.

The quadratic form for is deliberately simple. For the benchmark problems tested here — with states and inputs of order — it works well. But real systems are messier. Highly nonlinear dynamics, non-quadratic cost structures, or state spaces spanning multiple orders of magnitude may demand richer function approximators: neural networks, Gaussian processes, or polynomial expansions. The authors acknowledge this, noting that more general choices of may improve performance at the cost of larger datasets and potentially non-convex learning problems.

Stability is a related open question. The paper demonstrates empirically that the myopic controller tracks references under plant-model mismatch, but formal closed-loop stability guarantees are not established. Related work (Baltussen et al., 2025) suggests that enforcing descent conditions during learning can help; integrating such conditions with the KKT-residual framework is a natural next step.

The offline data generation cost — over two hours for 360 demonstrations in the fishing problem — is still substantial, even if it is a one-time expense. For systems where offline MINLP solving is feasible but online solving is not, this is an entirely reasonable trade. But for problems with even longer horizons or higher-dimensional state spaces, the offline cost could become prohibitive. The authors' insight that continuous relaxations can replace exact MINLP solutions for data generation partially addresses this, but further work is needed to quantify how much approximation quality degrades when relaxed demonstrations replace exact ones.

Finally, the current framework is demonstrated on problems with a small number of integer variables. The satellite problem has two ternary actuators; the fishing problem has one binary. Real industrial systems — power grids, logistics networks, chemical plants with many switching units — could have dozens or hundreds of integer decisions. Whether the KKT residual minimization approach scales gracefully to high-dimensional integer spaces remains an open and important question.

What the paper establishes, nonetheless, is a principled and practically viable pathway into a problem space that has long been considered largely intractable. By teaching a controller what the future is worth — from a few hundred carefully chosen examples — and then letting it think just one step ahead, Orrico, Heemels, and Krishnamoorthy have shown that the gap between theoretical optimality and real-time feasibility can be bridged with surprisingly modest machinery. The implication reaches beyond satellites and fish populations: anywhere discrete actuators and tight timing requirements coexist, myopic MINMPC offers a route from the textbook to the real world.

The approximate value function merely needs to be informative enough to compensate for the shortened horizon of the resulting myopic MPC formulation.

Comments (0)

No comments yet. Be the first to share your thoughts.