Parameter reduction vs. SIMBa

5× — Schur (Proj.) needs nx² weights; SIMBa needs 5nx² for the same state matrix size

Synthetic systems benchmarked

50 — Randomly generated stable discrete-time systems with nx=5, ny=3, nu=3

50,000 — AdamW optimizer with initial learning rate 1e-3; best validation weights retained

Candidate stable matrices per 2×2 block

≤15 — Finite closed-form candidate set searched during diagonal block projection in Algorithm 1

Orders of magnitude — Truncated Schur projection (JIT-compiled) vs. full Noferini-Poloni bi-level optimization

Pre-factorized parameter count

2nx² — Schur (Built) variant stores Z and T directly; intermediate between Proj. and SIMBa

Schur Decomposition Makes Neural Networks Stable

Imagine handing a neural network a few hundred measurements from a chemical plant and asking it to learn how the plant behaves — not just to fit the data, but to build a model you could hand to a control engineer and say: this will never blow up. That guarantee, called asymptotic stability, is the difference between a curiosity and a tool you'd trust with a real process. For decades it was considered nearly incompatible with gradient-based machine learning. A new paper by Vanegas, Lensu, and Ruiz (2026) changes that calculus, introducing a projection method that enforces stability at every training step using a surprisingly compact mathematical trick — and does it with up to five times fewer parameters than the current leading approach.

The Science

The problem lives at the intersection of control theory and sequence modelling. A state-space model — the workhorse of control engineering since the 1970s — represents a dynamical system as a pair of matrix equations: one that describes how an internal "state" vector $x [k]$ evolves over time, and one that maps that state to observable outputs $y [k]$ :

$x [k + 1] = A x [k] + B u [k], y [k] = C x [k] + D u [k]$

The matrix $A$ — call it the state matrix — is the heart of the model. Its eigenvalues (the solutions to the characteristic polynomial of $A$ , which encode the system's natural frequencies and decay rates) determine whether the model is stable. Schur stability — the discrete-time version — requires that all eigenvalues satisfy $|\lambda_i| \leq 1$; geometrically, every eigenvalue must lie inside or on the unit circle in the complex plane. Violate that, and the model's predictions compound errors exponentially, diverging into nonsense.

When you train a neural network, gradient descent has no particular reason to respect this geometric constraint. It will happily push eigenvalues outside the unit circle if that reduces the loss on the training batch. Prior solutions fell into three camps, each with a significant drawback. Regularization methods add a penalty term to the loss function, but require careful tuning of a penalty strength $ρ_{r}$ — too small and you get instability, too large and you shrink the state matrix to zero. Subspace identification methods (SIMs) give you stability for free if the source system is stable, but they aren't compatible with backpropagation and break down for nonlinear architectures. SIMBa (Di Natale et al., 2023), the current state of the art, uses Linear Matrix Inequalities (LMIs) — a powerful algebraic tool for encoding convex constraints — to build a parameterization that is always Schur stable. The catch: it needs $5 n_{x}^{2}$ weights to represent an $n_x^2$-sized state matrix, introducing what the authors call a "gradient fan-out effect" that can slow convergence.

The new approach from Vanegas et al. (2026) takes a different route, grounded in a classical piece of linear algebra called the real Schur decomposition. Any real square matrix $A$ can be written as $A = Z T Z^{⊺}$ , where $Z$ is an orthogonal matrix (a rigid rotation, no stretching) and $T$ is quasi-upper-triangular — mostly zeros below the diagonal, with occasional $2 \times 2$ blocks on the diagonal that encode complex-conjugate eigenvalue pairs. The eigenvalues of $A$ live entirely in those diagonal blocks of $T$ . That localization is the key. To enforce Schur stability, you only need to project those small blocks onto the unit circle — a fast, closed-form operation — leaving the rest of the structure intact.

The researchers built this into a weight-projection algorithm (Algorithm 1 in the paper) that runs at the end of every training batch. After a gradient step updates $A$ , the algorithm computes the Schur decomposition, scans the diagonal blocks, and snaps any that encode eigenvalues outside the unit circle to their nearest stable peer. For a $1 \times 1$ block (a real eigenvalue), the fix is trivial: divide by $max {1, ∣ λ ∣}$ . For a $2 \times 2$ block (a complex-conjugate pair), the algorithm searches a finite candidate set of at most 15 stable matrices and picks the closest one by Frobenius norm. Then it reconstructs $\hat{A} = Z \hat{T} Z^{⊺}$ and continues training. The whole thing is JIT-compiled for speed.

They also propose a pre-factorized variant ("Schur Built"): instead of decomposing $A$ at each step, store $Z$ and $T$ directly as trainable parameters, keeping $Z$ orthogonal via singular-value decomposition (SVD) and stabilizing $T$ with the same algorithm. This avoids recomputing the Schur factorization every batch, trading a small increase in parameter count ($2n_x^2$ instead of $n_x^2$) for faster forward passes.

The theoretical foundation rests on a 2023 result by Noferini and Poloni — cited extensively throughout the paper — who proved that projecting the quasi-triangular Schur factor to its nearest stable peer is equivalent to one iteration of the globally-convergent bi-level optimization for the nearest $\Omega$-stable matrix problem. The full bi-level algorithm is provably optimal but far too slow for training (orders of magnitude slower, as Table 1 in the paper shows). The truncated version is fast enough to use every batch, and for the purposes of sequence modelling — where eigenvalue approximation matters more than Frobenius-norm minimality — it actually performs better on the key spectral metric (NSSR, normalized squared spectral radius).

What They Found

The experiments split into two phases: a synthetic benchmark replicating the setup from Di Natale et al. (2023), and a real-world nonlinear architecture test.

Parameter Count: Schur vs. SIMBa for State Matrix of Size nx²

Number of trainable weights required to parameterize a Schur-stable state matrix under each method, as a function of state dimension nx. Schur (Proj.) uses nx², Schur (Built) uses 2nx², and SIMBa uses 5nx².

Parameter Count: Schur vs. SIMBa for State Matrix of Size nx²
Label	Value
nx=5	25
nx=10	100
nx=20	400
nx=50	2,500

For the synthetic benchmark, 50 randomly generated stable discrete-time systems were used, each with state dimension $n_{x} = 5$ , three inputs, and three outputs. Each system was given 300 training samples, 300 validation samples, and 300 test samples of input-output data, with inputs drawn from a Generalized Binary Noise distribution. Models were trained for up to 50,000 epochs using AdamW. The key result: both Schur-based methods matched SIMBa's accuracy and convergence rate on these linear systems, with only a marginal increase in wall-clock time per epoch — an acceptable cost given the parameter savings.

(a) n=10n=10 Source: Sergio Vanegas, Lasse Lensu

(b) n=20n=20 Source: Sergio Vanegas, Lasse Lensu

The matrix-approximation benchmarks (Table 1 in the paper, with figures across matrix sizes $n = 10, 20, 50, 100$) reveal something important about the tradeoff. For small matrices, the full Noferini-Poloni bi-level optimization achieves a lower Normalized Squared Frobenius Error (NSFE — the Frobenius-norm distance between the original unstable matrix and its nearest stable projection, divided by the original matrix's squared norm). But for larger matrices ($n \geq 50$), the bi-level method's execution-time limit forces early termination, and the truncated Schur projection wins on NSFE. More importantly, the truncated method consistently produces a better Normalized Squared Spectral Radius (NSSR — a measure of how well the projected eigenvalues match the original eigenvalues, matched optimally by solving an integer linear program). This matters because for system identification, preserving the dynamical character of the original system is more valuable than minimizing the Frobenius distance to an arbitrary stable matrix.

(f) n=10n=10 Source: Sergio Vanegas, Lasse Lensu

(g) n=20n=20 Source: Sergio Vanegas, Lasse Lensu

The Mean Squared Violation Radius (MSVR) — a metric measuring how far any eigenvalues stray outside the unit circle after projection — is effectively zero for both projection methods, confirming that the stability guarantee holds numerically in practice.

Projection Quality: Truncated Schur vs. Bi-level Optimization (n=10)

Comparison of the truncated Schur projection (Algorithm 1) vs. the full Noferini-Poloni bi-level optimization across three quality metrics for 10×10 matrices. Lower is better for all metrics. Values are qualitative rankings derived from Table 1 of the paper (bi-level wins on NSFE for small matrices; truncated wins on NSSR; both achieve near-zero MSVR).

Projection Quality: Truncated Schur vs. Bi-level Optimization (n=10)
Label	Value
NSFE (lower=better)	3
NSSR (lower=better)	5
MSVR (lower=better)	5
Speed (higher=faster)	5

On real-world datasets (using a Hammerstein-Wiener nonlinear architecture — a structure that sandwiches a linear state-space layer between two nonlinear "wrapper" functions), the lower parameter count of the Schur-based methods translated into a concrete convergence advantage. The pre-factorized "Schur Built" variant, with $2 n_{x}^{2}$ parameters, and the projected variant, with $n_{x}^{2}$ parameters, both converged more reliably than SIMBa's $5 n_{x}^{2}$ formulation. The regularization baseline was sensitive to the choice of $ρ_{r}$ , occasionally producing null or unstable state matrices when the hyperparameter was poorly tuned.

Stability Methods Compared: Key Properties

Qualitative comparison of four stable state-space identification methods across four key properties, scored 1–5. Based on characterizations in Vanegas et al. (2026), Sections 2 and 3.

Stability Methods Compared: Key Properties
Label	Value
Schur Proj. — Stability	5
Schur Proj. — Param. Efficiency	5
Schur Proj. — No Hyperparams	5
SIMBa — Stability	5
SIMBa — Param. Efficiency	1
SIMBa — No Hyperparams	5
Regularized — Stability	2
Regularized — Param. Efficiency	5

Why This Changes Things

The significance here goes well beyond academic benchmarking. State-space neural networks are increasingly being deployed — or seriously considered for deployment — in real-time control applications: robotic manipulators, autonomous vehicles, power-grid frequency regulation, biomedical devices. In all of these domains, a model that can produce a diverging output trajectory isn't just inaccurate; it's dangerous. The existing toolkit for ensuring stability either requires expert hyperparameter tuning (regularization), is fundamentally incompatible with backpropagation (classical SIMs), or introduces so much parameter overhead that training becomes unreliable at the scales relevant to edge deployment.

The parameter efficiency story is particularly consequential. The difference between $n_{x}^{2}$ and $5 n_{x}^{2}$ weights might sound abstract, but consider a state-space layer with $n_{x} = 50$ : SIMBa needs 12,500 weights to parameterize the state matrix alone; the Schur projection needs 2,500. At the scales typical of embedded control systems — where you might be running inference on a microcontroller — that difference determines whether a model fits in memory at all. And even where memory isn't the constraint, over-parameterization has a well-understood cost during training: more weights mean more gradient directions, more saddle points, and slower convergence. The Schur method's empirical convergence advantage on real-world nonlinear datasets is almost certainly a direct consequence of this reduction.

There's also a conceptual clarity to the approach that rivals can't quite match. Regularization methods enforce stability softly — they discourage instability without preventing it. LMI-based methods like SIMBa enforce it structurally, but at the cost of a convoluted parameterization whose geometric meaning is opaque. The Schur method enforces stability by operating directly on the eigenvalues, the quantities that define stability, and doing so with a projection that has a clear geometric interpretation: snap each eigenvalue to the nearest point on or inside the unit circle, with minimal distortion to the overall matrix structure. That transparency is valuable not just philosophically but practically — it makes the method easier to debug, audit, and extend.

The paper also situates this work within the broader renaissance of state-space models in machine learning. Systems like S4 (Gu et al., 2022) and Mamba (Gu & Dao, 2023) have demonstrated that state-space layers can compete with Transformers for sequence modelling at scale, using the HiPPO matrix — a specially designed stable initialization — as their foundation. But those architectures are built for billion-parameter language models and DNA sequencers, not for the 5- to 50-state systems that dominate industrial control. The Schur projection fills a gap in the lower-dimensional regime, providing rigorous stability guarantees at the scales where control engineers actually live.

What's Next

The authors are candid about the method's current scope. The experiments focus on discrete-time linear state-space layers embedded within nonlinear architectures — a setting that covers a wide range of practical systems but doesn't yet extend to continuous-time formulations (used by S4 and Mamba) or to fully nonlinear state-transition functions. Extending the Schur projection to continuous-time Hurwitz stability — where the requirement shifts from $∣ λ_{i} ∣ \leq 1$ to $Re (λ_{i}) \leq 0$ — is a natural next step, and the theoretical machinery from Noferini and Poloni (2023) is general enough to support it.

The synthetic benchmark is also limited to relatively low-dimensional systems ($n_x = 5$). Scaling to higher state dimensions while maintaining the computational advantage over SIMBa remains to be demonstrated, though the asymptotic analysis in the paper is encouraging: the truncated projection's execution-time advantage grows with matrix size, while the bi-level optimization's advantage shrinks.

A subtler open question concerns the interaction between the Schur projection and the nonlinear components of stacked architectures. In a Hammerstein-Wiener model, the stability of the linear core doesn't automatically guarantee stability of the full nonlinear system — that depends on properties of the nonlinear wrappers as well. The paper's framing is honest about this: the Schur method ensures the linear layer behaves, but a full stability certificate for the nonlinear stack would require additional analysis, perhaps drawing on Lyapunov-based tools from nonlinear control theory.

The code is fully open-source at codeberg.org/sergiovaneg/SchurSS, with a GPU-compatible custom Schur decomposition included to handle the fact that standard deep-learning frameworks don't natively expose a JIT-compilable Schur routine. That openness matters. The hardest part of deploying stability-constrained neural networks in industrial settings isn't usually the math — it's the gap between a published algorithm and a production-ready implementation. Closing that gap is how academic results become engineering tools.

The direction of travel is clear. As neural network-based control and system identification move from research labs into power plants, medical devices, and autonomous systems, the demand for models with certified behavioral properties — not just empirically good ones — will only grow. The Schur projection method offers a principled, efficient, and mathematically transparent path toward that certification. It won't be the last word, but it's a genuinely useful one.

The Stability Trick That Could Make AI-Powered Control Systems Finally Trustworthy

The Science

What They Found

Why This Changes Things

What's Next

Source articles

Comments (0)