Myopic MINMPC: Real-Time Control via Learned

Q: Expert Demonstrations Used

M = 360 — Three closed-loop trajectories from the Lotka-Volterra fishing problem, each from a different initial state

Q: Offline Data Generation Time

8,420 seconds — Total CPU time to generate the 360-point training dataset by running the full 30-step MINMPC offline

Q: Online Planning Horizon

η = 1 step — The myopic controller looks just one step ahead, versus the 30-step expert horizon — enabled by the learned value function

Q: Max KKT Stationarity Residual

9.70 × 10⁻⁵ — Near-zero residual confirms the expert demonstrations are approximately optimal under the learned quadratic cost-to-go

Q: Discrete Actuator Configurations

Binary & ternary — Fishing problem uses z ∈ {0,1}; satellite uses z₁, z₂ ∈ {-1, 0, 1} — both NP-hard in full MPC form

Q: Full MINMPC Horizon

N = 30 steps — The expert controller horizon for both benchmark problems, replaced online by a 1-step lookahead with learned terminal cost

One of the most stubborn problems in engineering control is deceptively simple to state: how do you make a machine plan wisely in real time when its actions are discrete? Not "turn the thruster to 37%," but "fire or don't fire." Not a dial, but a switch. That binary character — shared by satellite thrusters, fishing quotas, insulin pumps, power converters, and dozens of other real-world actuators — turns what would otherwise be a manageable optimization problem into something that belongs to the class of problems mathematicians call $\mathcal{NP}$-hard. The computational cost doesn't just grow with the planning horizon; it explodes exponentially.

The standard engineering solution — model predictive control, or MPC — works by solving a finite-horizon optimization at every time step, implementing only the first action, and then re-solving from the new state. It is powerful and widely deployed. But when your actions are integers rather than continuous numbers, each time step demands solving a mixed-integer nonlinear program (MINLP): a beast that can take hours to crack for modest-sized problems. For a satellite that needs a fresh decision every half-second, this is simply not viable.

A new paper by Christopher Orrico, W. P. M. H. Heemels, and Dinesh Krishnamoorthy from Eindhoven University of Technology and the Norwegian University of Science and Technology proposes an elegant escape route (Orrico et al., 2026). Instead of solving the full long-horizon problem online, they teach the controller what the future is worth — offline, from examples — and then let the online planner look just one step ahead, using that learned knowledge as a compass. The result is a "myopic" controller that matches the performance of a 30-step planner while solving a problem that is orders of magnitude smaller.

The Science

The technical framework sits at the intersection of three ideas: model predictive control, Bellman's principle of optimality, and inverse optimization.

MPC, in its standard form, solves an optimal control problem at each time step: minimize accumulated cost over a prediction horizon of $N$ steps, subject to dynamics and constraints. For systems with discrete actuators — captured by integer variables $z_{k} \in Z^{n_{z}}$ alongside continuous controls $u_{k}$ — this becomes a mixed-integer nonlinear MPC (MINMPC) problem. Longer horizons give better performance but make the problem exponentially harder to solve.

Bellman's principle of optimality offers a theoretical key. It says you can split any long-horizon problem into a short piece you solve explicitly and a "tail" piece you represent as a value function — the optimal cost-to-go from whatever state you reach at the end of the short piece. If you know that value function exactly, a one-step planner is provably equivalent to the full $N$-step planner. The catch: computing the true value function for all possible states is itself intractable, sometimes described as the "curse of dimensionality."

The authors' innovation is to learn an approximate value function $\hat{V} (x, θ)$ — parameterized by a matrix $P$ , giving a quadratic form $\hat{V} (x, θ) := x^{T} P x$ — from offline expert demonstrations. Those demonstrations are state-action pairs $(x_{i}, w_{i}^{*})$ drawn from the expensive, long-horizon MINMPC solver. Rather than asking "what is the value at this state?" (which the expert doesn't directly tell you), they ask: "what value function would make the expert's actions look optimal?"

This is the province of inverse optimization. The specific tool they use is KKT residual minimization — KKT standing for Karush-Kuhn-Tucker, the mathematical conditions that characterize an optimal solution to a smooth optimization problem. In plain terms: find the parameters $P$ such that the expert's observed actions come as close as possible to satisfying the optimality conditions of the one-step planning problem.

Here is where the central technical puzzle arises. KKT conditions require calculus — differentiability, gradients, Lagrange multipliers. Integer constraints break all of that. A problem with $z \in {0, 1}$ has no gradients at integer boundaries; standard KKT theory simply doesn't apply. The authors resolve this with a deliberate "dual treatment": during offline learning, integer constraints are relaxed so that $z$ is treated as a continuous variable. This opens the door to KKT-based learning. Then, when the learned value function is deployed online, the integer constraints are fully reinstated. The controller remains genuinely mixed-integer and feasible; the learning phase just used a smooth surrogate to extract the structure of what the future costs.

Critically, the authors also show that even the offline expert demonstrations themselves can be generated from continuous relaxations of the MINMPC problem — meaning you never have to solve expensive MINLPs just to build your training set. This is a significant practical relief.

The resulting inverse optimization problem — finding the best $P$ — takes the form of a semidefinite program (SDP), a convex optimization class that solvers handle efficiently. By restricting $\hat{V}$ to the quadratic family, the authors deliberately limit the complexity of the learned function, which improves sample efficiency: you don't need millions of demonstrations to recover something useful.

What They Found

The framework was validated on two benchmark problems. The first is the Lotka-Volterra fishing problem — a classic in mixed-integer optimal control, describing a predator-prey ecosystem where a manager must decide, at each time step, whether to fish ( $z=1$) or not ($z=0$), with the goal of stabilizing both populations near a target level. The dynamics are genuinely nonlinear: prey population $x_1$ grows logistically, predator $x_{2}$ depends on prey, and fishing affects both with different rate coefficients ($c_1 = 0.4$, $c_2 = 0.2$). The full MINMPC uses a prediction horizon of $N=30$ steps.

To train the value function, the authors ran the full expert MINMPC from three different initial states, collecting $M = 360$ state-action pairs. Generating this training data took a total of $t_{CPU} = 8.42 \times 1 0^{3}$ seconds — more than two hours. The SDP then produced the matrix:

$P = [0.5965 0.4627 0.4627 0.3589]$

The maximum KKT stationarity residual was $∥ r_{s t a t} ∥_{\infty} = 9.70 \times 1 0^{- 5}$ and the complementary slackness residual was $∥ r_{co m p} ∥_{\infty} = 9.67 \times 1 0^{- 5}$ — both tiny relative to the scale of $P$ , indicating the expert demonstrations are nearly optimal under the learned value function.

KKT Residuals vs. Matrix Scale (Lotka-Volterra)

Maximum KKT residuals from the learned value function compared to the scale of the learned matrix P entries, showing near-zero residuals relative to learned parameters.

KKT Residuals vs. Matrix Scale (Lotka-Volterra)
Label	Value
P[1,1] (0.5965)	0.5965
P[1,2] (0.4627)	0.4627
P[2,2] (0.3589)	0.3589
Max Stationarity Residual	0.000097
Max Complementarity Residual	0.0000967

Deployed online with a horizon of $η = 1$ — just one step ahead — and with 10% plant-model mismatch (the true fishing coefficients differed from the model's), the myopic MINMPC tracked the reference population of $[x_{1}^{ref}, x_{2}^{ref}]^{T} = 1$ nearly as well as the full 30-step controller. Crucially, a one-step MINMPC without the learned value function failed to drive the populations to the reference — confirming that the learned cost-to-go, not the planning depth, is doing the work.

Figure 1: (a) Prey and (b) predator state populations for the expert demonstrations (in light blue), MINMPC controller (in blue), and myopic MINMPC controller (in red). The (c) control decisions and (d) the computation time per controller decision. Source: Christopher Anthony Orrico, W. P. M. H. Heemels

The second test involved a satellite in orbit, modeled in polar coordinates with three states — orbital radius $r = x_{1}$ , radial velocity $\overset{r}{˙} = x_{2}$ , and angular velocity $\dot{θ} = x_{3}$ — and two discrete thrusters, each capable of firing in three modes: $z_{1}, z_{2} \in {- 1, 0, 1}$ . The satellite must maintain a safe orbital altitude ($3 \leq x_1 \leq 7$) and reach a target orbit. With $N = 30$ and two ternary actuators, the full MINMPC is substantially harder than the fishing problem. The myopic controller, again with $η = 1$ , maintained safe orbital bounds and matched the trajectory quality of the full expert in closed-loop simulation.

Figure 2: (a) Polar plot of the satellite trajectories for the offline expert demonstrations (light blue), MINMPC (blue), and myopic MINMPC (red). Each initial state is denoted by a circle. (b) z1z_{1} and (c) z2z_{2} for each controller. (d) computation time per controller decision. Source: Christopher Anthony Orrico, W. P. M. H. Heemels

Training Data Size vs. Horizon: Myopic vs. Expert MINMPC

Comparison of key controller parameters: the expert MINMPC uses a 30-step horizon, while the myopic controller uses only 1 step ahead, trained on 360 demonstrations.

Training Data Size vs. Horizon: Myopic vs. Expert MINMPC
Label	Value
Expert MINMPC Horizon (steps)	30
Offline Demonstrations (M)	360
Offline Data Generation Time (s, ÷100)	84.2

Key Facts

M = 360 Expert Demonstrations Used Three closed-loop trajectories from the Lotka-Volterra fishing problem, each from a different initial state

8,420 seconds Offline Data Generation Time Total CPU time to generate the 360-point training dataset by running the full 30-step MINMPC offline

η = 1 step Online Planning Horizon The myopic controller looks just one step ahead, versus the 30-step expert horizon — enabled by the learned value function

9.70 × 10⁻⁵ Max KKT Stationarity Residual Near-zero residual confirms the expert demonstrations are approximately optimal under the learned quadratic cost-to-go

Binary & ternary Discrete Actuator Configurations Fishing problem uses z ∈ {0,1}; satellite uses z₁, z₂ ∈ {-1, 0, 1} — both NP-hard in full MPC form

N = 30 steps Full MINMPC Horizon The expert controller horizon for both benchmark problems, replaced online by a 1-step lookahead with learned terminal cost

Why This Changes Things

To appreciate the significance, it helps to understand what real-time control actually demands. A satellite attitude controller that fires thrusters every 0.5 seconds cannot wait hours — or even seconds — for an optimizer to finish. The same applies to power electronics switching at kilohertz frequencies, or insulin delivery systems responding to glucose spikes. These applications have been largely out of reach for principled mixed-integer MPC precisely because solving MINLPs in real time is intractable.

Previous workarounds have involved approximations that often sacrifice accuracy (rounding relaxed solutions), architectures that sacrifice interpretability (deep neural networks approximating the full policy), or methods that require extensive interaction with the real system (reinforcement learning). Each carries a cost. Rounding can produce infeasible or suboptimal actions. Neural network policy approximation requires enormous offline datasets covering the full state space — and any update to the controller demands retraining. Reinforcement learning requires trial-and-error on the real system, which is unacceptable when mistakes are expensive or dangerous.

The myopic MINMPC framework avoids all three pitfalls. The learned object is not a full policy but a value function — a single matrix $P$ in the quadratic case. It requires only a modest number of expert demonstrations ($M = 360$ sufficed for the fishing problem). And because the online controller is still an explicit optimization problem with constraints, it inherits the interpretability and feasibility guarantees of MPC. Constraints are satisfied by construction; the controller can be inspected and reasoned about.

The conceptual move of relaxing integers during learning but reinstating them online is particularly clean. It avoids the theoretical dead-end of applying KKT conditions to discrete problems, while preserving the practical integrity of the controller. As the authors note, this is consistent with how nonlinear MPC is routinely solved in practice anyway — numerical solvers find KKT-satisfying stationary points, which are necessary but not sufficient for global optimality. The framework just extends this pragmatic posture to the learning phase.

Myopic MINMPC: Multi-Dimensional Performance Profile

Qualitative assessment of the myopic MINMPC framework across key engineering dimensions relative to baseline approaches, based on findings reported in the paper.

Myopic MINMPC: Multi-Dimensional Performance Profile
Label	Value
Real-Time Feasibility	9
Closed-Loop Performance	8
Data Efficiency	8
Interpretability	9
Constraint Satisfaction	9
Offline Cost	6

There is also a subtle result embedded in the paper's analysis: the learned policy is approximately "policy-consistent" with the expert. This is a formal notion — as the number of demonstrations $M \to \infty$ , the probability that the learned policy deviates from the expert's policy by more than $ϵ$ goes to zero. In finite-data practice, the near-zero KKT residuals observed in both benchmarks serve as practical evidence of this consistency.

What's Next

The paper is honest about its limitations, and they point toward a rich research agenda.

The quadratic form for $\hat{V} (x, θ) = x^{T} P x$ is deliberately simple. For the benchmark problems tested here — with states and inputs of order $O (1)$ — it works well. But real systems are messier. Highly nonlinear dynamics, non-quadratic cost structures, or state spaces spanning multiple orders of magnitude may demand richer function approximators: neural networks, Gaussian processes, or polynomial expansions. The authors acknowledge this, noting that more general choices of $\hat{V}$ may improve performance at the cost of larger datasets and potentially non-convex learning problems.

Stability is a related open question. The paper demonstrates empirically that the myopic controller tracks references under plant-model mismatch, but formal closed-loop stability guarantees are not established. Related work (Baltussen et al., 2025) suggests that enforcing descent conditions during learning can help; integrating such conditions with the KKT-residual framework is a natural next step.

The offline data generation cost — over two hours for 360 demonstrations in the fishing problem — is still substantial, even if it is a one-time expense. For systems where offline MINLP solving is feasible but online solving is not, this is an entirely reasonable trade. But for problems with even longer horizons or higher-dimensional state spaces, the offline cost could become prohibitive. The authors' insight that continuous relaxations can replace exact MINLP solutions for data generation partially addresses this, but further work is needed to quantify how much approximation quality degrades when relaxed demonstrations replace exact ones.

Finally, the current framework is demonstrated on problems with a small number of integer variables. The satellite problem has two ternary actuators; the fishing problem has one binary. Real industrial systems — power grids, logistics networks, chemical plants with many switching units — could have dozens or hundreds of integer decisions. Whether the KKT residual minimization approach scales gracefully to high-dimensional integer spaces remains an open and important question.

What the paper establishes, nonetheless, is a principled and practically viable pathway into a problem space that has long been considered largely intractable. By teaching a controller what the future is worth — from a few hundred carefully chosen examples — and then letting it think just one step ahead, Orrico, Heemels, and Krishnamoorthy have shown that the gap between theoretical optimality and real-time feasibility can be bridged with surprisingly modest machinery. The implication reaches beyond satellites and fish populations: anywhere discrete actuators and tight timing requirements coexist, myopic MINMPC offers a route from the textbook to the real world.

Teaching Robots to Think Shorter: How AI-Learned Shortcuts Could Unlock Real-Time Control of Complex Systems

The Science

What They Found

Why This Changes Things

What's Next

Source articles

Comments (0)