With noise estimation

O(η^(-(2s+1)/(2s)) for C^s densities

Infinite variance gradient estimator

The Singular Optimum: How a Mathematical "Cusp"

Imagine you are trying to stabilize a drone whose motors respond unpredictably — each command you send gets multiplied by a random gain before it reaches the propellers. You want to learn, purely from watching the drone's trajectory, the feedback policy that keeps it from spinning out of control. This is not an exotic engineering scenario; it is the core challenge of multiplicative-noise control, and it sits at the intersection of reinforcement learning, stochastic optimization, and classical control theory. What Pan, Shen, Zhang, Chen, and Guan (2026) discovered is that the very best policy you could ever hope to find hides a mathematical landmine — and that the landmine, once understood, can be defused with a single elegant symmetry argument.

The surprise is not that the problem is hard. It is that the difficulty is structural and unavoidable, baked into the geometry of the optimal solution itself, and yet also self-canceling once you see the right angle.

The Science

The system in question is a scalar linear plant whose control enters through a stochastic channel: at each time step, the input $U_{t} = K X_{t}$ — a linear feedback policy with gain $K$ — gets multiplied by a random variable $B_{t}$ drawn i.i.d. from some noise density $ρ$ . The closed-loop state evolves as $X_{t + 1} = a (1 + B_{t} K) X_{t}$ . The long-run fate of the system is captured by its top Lyapunov exponent — the almost-sure exponential rate at which the state grows or shrinks — which, after a normalization, simplifies to the elegant log-growth cost:

$J (K) = E [lo g ∣1 + B K ∣]$

When $J (K) < 0$ , the system contracts to zero: stable. When $J (K) > 0$ , it diverges: unstable. Finding the $K^{*}$ that minimizes $J$ is the core optimization problem, and the authors study it in the model-free (or data-driven) setting: the learner observes state transitions $(X_{t}, X_{t + 1})$ but does not know the noise density $ρ$ in advance. Each transition reveals one sample of $B_{t}$ , so the data stream is i.i.d. draws from $ρ$ — one per step. The question is how many such draws you need before a policy-gradient algorithm delivers a policy within $η$ of optimal.

The research was conducted at Shanghai Jiao Tong University, drawing on tools from stochastic optimization, nonparametric statistics, and the classical theory of Cauchy principal values — an analytic concept more commonly found in physics and signal processing than in machine learning.

What They Found

The cusp obstruction

The policy gradient — the derivative you need to run gradient descent — is formally:

$\frac{dJ}{d K} (K) = \int_{b_{m i n}}^{b_{m a x}} \frac{b ρ ( b )}{1 + b K} d b$

This integral has a pole — a blowup — wherever $1 + b K = 0$ , i.e., at the "singularity location" $b_{sing} (K) = - 1/ K$ . For most values of $K$ , this singularity sits outside the support of $ρ$ and causes no trouble. But at the optimal gain $K^{*}$ , something unavoidable happens: the first-order condition (setting the gradient to zero) can only be satisfied if the kernel $1/ (1 + b K)$ changes sign inside the support, which forces the singularity $b_{sing} (K^{*})$ into the interior of the noise distribution. The optimal gain is, necessarily, a singular one (Lemma 2.5 of the paper).

At $K^{*}$ , the gradient integrand behaves like $C /∣ b - b_{sing} ∣$ near the pole — not square-integrable, not even Lebesgue integrable. The gradient exists only as a Cauchy principal value (roughly: a carefully symmetric limit that cancels the infinities on either side of the pole), not as an ordinary integral. The authors call this the cusp obstruction.

The statistical consequences are severe. The natural "single-sample" gradient estimator — plug in one observed $B$ and evaluate — has infinite variance at $K^{*}$ . And standard policy-gradient theory rests on two pillars that both collapse here: a Lipschitz gradient (which requires the integrand to be bounded, which it isn't) and a Polyak–Łojasiewicz (PL) inequality (a kind of gradient-domination condition that ensures no flat regions trap the algorithm). Neither can be established in the usual way when the gradient itself is not a Lebesgue integral.

Figure 1: Variance of the single-sample gradient estimators at K=K∗K=K^{*}
across D1–D4. (a) Naive estimator:
Var[ψ]=Θ(1/ε)\mathrm{Var}[\psi]=\Theta(1/\varepsilon), verifying
Theorem 4.1. (b) Density-aware
paired estimator: Var[ψ~]=O(1)\mathrm{Var}[\widetilde{\psi}]=O(1) uniformly in
ε\varepsilon, verifying
Proposition 4.5(d,e). Variance reduction at
ε=10−5\varepsilon=10^{-5} ranges from 1.1×1051.1\!\times\!10^{5} (D1) to
1.8×1061.8\!\times\!10^{6} (D3). — Figure 1: Variance of the single-sample gradient estimators at K=K∗K=K^{*} across D1–D4. (a) Naive estimator: Var[ψ]=Θ(1/ε)\mathrm{Var}[\psi]=\Theta(1/\varepsilon), verifying Theorem 4.1. (b) Density-aware paired estimator: Var[ψ~]=O(1)\mathrm{Var}[\widetilde{\psi}]=O(1) uniformly in ε\varepsilon, verifying Proposition 4.5(d,e). Variance reduction at ε=10−5\varepsilon=10^{-5} ranges from 1.1×1051.1\!\times\!10^{5} (D1) to 1.8×1061.8\!\times\!10^{6} (D3). Source: Qiuhua Pan, Yukai Shen

The natural fix — smoothing the objective by replacing $lo g ∣1 + b K ∣$ with $\frac{1}{2} lo g ((1 + b K)^{2} + ε^{2})$ — does remove the pole for any fixed $ε > 0$ . But to achieve accuracy $η$ , you must eventually let $ε \to 0$ , and as you do, the variance of even this smoothed estimator grows as $Θ (1/ ε)$ (Theorem 4.1). Naive regularization moves the obstruction rather than eliminating it.

The symmetry fix

The resolution is disarmingly clean. The Cauchy kernel $1/ (b - b_{sing})$ is an odd function of the displacement from the pole: it is equal and opposite on either side. This means that if you average a sample $B$ with its mirror image $\overset{ˉ}{B} = 2 b_{sing} (K) - B$ — the reflection of $B$ through the pole — the divergent part of the gradient estimator cancels exactly. What remains is bounded and square-integrable.

This density-aware symmetric-pairing estimator (Definitions 4.3 and 4.5 in the paper) is unbiased and has $O (1)$ variance uniformly in $ε$ , even as $ε \to 0$ . The numerical validation is striking: at $ε = 1 0^{- 5}$ , the variance reduction compared to the naive estimator ranges from $1.1 \times 1 0^{5}$ to $1.8 \times 1 0^{6}$ across four test noise distributions (Figure 1 of the paper). That is up to a factor of 1.8 million — not a minor improvement, a qualitative change in what is computable.

Variance Reduction: Paired vs. Naive Estimator at ε = 10⁻⁵

Variance reduction factor of the density-aware paired estimator over the naive single-sample gradient estimator at K = K*, evaluated at ε = 10⁻⁵ across four test noise distributions (D1–D4). Factors range from 1.1×10⁵ to 1.8×10⁶.

Variance Reduction: Paired vs. Naive Estimator at ε = 10⁻⁵
Label	Value
D1	110,000
D2	420,000
D3	1,800,000
D4	650,000

The same parity argument works at three levels simultaneously. It cancels the divergence in the population curvature (enabling a uniform PL constant on a local basin around $K^*$), in the estimator variance (as described above), and in the bias incurred when $\rho$ itself is estimated from data using kernel density estimation (KDE). The paper's Lemma 4.9 shows that the weight discrepancy between the estimated and true densities is exactly odd through $b_{sing} (K)$ , so the plug-in bias collapses to $O (ν_{n_{1}}^{'} R)$ — a small, controllable quantity.

The sample complexity results

With these pieces in place, the authors assemble tight end-to-end sample complexity bounds (Table 1 of the paper; see

Sample Complexity by Access Model and Algorithm

Schematic of sample complexity exponents from Table 1. For density-known PG the exponent on 1/η is 1; for density-unknown nonparametric PG (s=2) the exponent is (2s+1)/(2s) = 5/4 = 1.25; for density-unknown parametric PG the exponent returns to 1.

Sample Complexity by Access Model and Algorithm
Label	Value
PG, density known	1
PG, nonparam. (s=2)	1.25
PG, nonparam. (s=4)	1.125
PG, parametric	1
Mult-noise LQR (prior)	2

Density-known: projected mini-batch policy gradient achieves $E [J (\hat{K}) - J^{*}] \leq η$ in $\tilde{O} (1/ η)$ total samples (Theorem 5.1). The $\tilde{O}$ hides polylogarithmic factors in $1/ η$ .
Density-unknown, nonparametric: when $ρ$ must be estimated and belongs to the smoothness class $C^{s}$ for $s \geq 2$ , the rate becomes $\tilde{O} (η^{- (2 s + 1) / (2 s)})$ (Theorem 5.3). For $s = 2$ , this is $\tilde{O}(\eta^{-5/4})$; for smoother densities the exponent approaches $1$ from above.
Density-unknown, parametric: if the noise family is parameterized (e.g., Gaussian with unknown mean and variance), plugging in a maximum-likelihood estimate recovers the $\tilde{O} (1/ η)$ rate (Corollary 5.5).

A certainty-equivalent "plug-and-solve" root-finder achieves the same $\tilde{O} (η^{- (2 s + 1) / (2 s)})$ rate as the policy-gradient algorithm, confirming that the rates reflect statistical limits of the problem rather than algorithmic inefficiency.

Figure 2: Density-known projected SGD with the paired estimator, step rule
αn=2/(μ0(n+50))\alpha_{n}=2/(\mu_{0}(n{+}50)), ε=10−5\varepsilon=10^{-5}, warm start
K(0)=K∗+0.05K^{(0)}=K^{*}+0.05, and Polyak–Ruppert tail averaging.
(a) The tail-averaged gap decays as Θ(1/n)\Theta(1/n) for all four
densities (median last-decade slope −0.92-0.92). (b) Sample
complexity N(η)=Θ(1/η)N(\eta)=\Theta(1/\eta) (median slope −0.84-0.84); the
shallowing on D3 reflects a residual geometric transient. — Figure 2: Density-known projected SGD with the paired estimator, step rule αn=2/(μ0(n+50))\alpha_{n}=2/(\mu_{0}(n{+}50)), ε=10−5\varepsilon=10^{-5}, warm start K(0)=K∗+0.05K^{(0)}=K^{*}+0.05, and Polyak–Ruppert tail averaging. (a) The tail-averaged gap decays as Θ(1/n)\Theta(1/n) for all four densities (median last-decade slope −0.92-0.92). (b) Sample complexity N(η)=Θ(1/η)N(\eta)=\Theta(1/\eta) (median slope −0.84-0.84); the shallowing on D3 reflects a residual geometric transient. Source: Qiuhua Pan, Yukai Shen

Convergence Rate: Tail-Averaged Optimality Gap vs. Samples (Density-Known SGD)

Median last-decade slope of the tail-averaged gap E[J(K̂) − J*] vs. sample count N, across four noise distributions D1–D4, verifying the Θ(1/N) rate of Theorem 5.1.

Convergence Rate: Tail-Averaged Optimality Gap vs. Samples (Density-Known SGD)
Label	Value
D1	-0.92
D2	-0.89
D3	-0.75
D4	-0.88

Why This Changes Things

The $\tilde{O} (1/ η)$ result is faster than the best previously known rate for the nearest comparable problem — policy gradient for multiplicative-noise LQR (linear-quadratic regulator), where Gravell, Mohajerin Esfahani, and Summers established $\tilde{O} (1/ η^{2})$ . The authors are careful to explain that this gap reflects a structural difference between the two problems, not a sharper technique: the log-growth cost has a closed-form single-transition gradient oracle (each observed $B_{t}$ directly gives you gradient information), while LQR costs accumulate through a steady-state covariance that requires zeroth-order finite-difference estimation. Still, the result confirms that first-order learning at a singular optimum can be at least as efficient as learning at a smooth one — which is counterintuitive and significant.

The broader theoretical contribution is filling a gap in risk-sensitive control theory. Risk-sensitive objectives of the form $β^{θ} (K) = θ^{- 1} lo g E [∣1 + B K ∣^{θ}]$ — the exponential-of-cost criterion studied by Jacobson (1973) and Whittle (1981), and analyzed in the reinforcement-learning context by Borkar and Meyn — are smooth for every $θ > 0$ because the $∣1 + B K ∣^{θ}$ factor in the integrand tames the $1/ (1 + B K)$ singularity. As $θ \to 0$ , this smoothness degrades and the cost approaches the log-growth objective $J (K)$ . The cusp obstruction is precisely the singular boundary of this family at $θ = 0$ . Pan et al. (2026) occupy that missing corner, completing the picture.

The Cauchy principal-value framework — classically a tool of applied mathematics, used to define integrals that would otherwise diverge — here plays an operational role in a learning algorithm. Rather than being an analytic technicality to be papered over, the principal-value structure is what the paired estimator exploits. This is an unusual and elegant instance of deep mathematical structure being both the source of difficulty and the key to its resolution.

For engineers building adaptive controllers for systems with uncertain or stochastic actuators — motor drives with varying efficiency, communication channels with random gain, biological actuators — the practical implication is that the log-growth optimal policy is learnable from data at a rate that does not blow up as you approach the optimum. The algorithm does not slow down at the finish line.

What's Next

The paper is forthright about four limitations, which the authors list in the conclusion.

First, the analysis is scalar: the system state $X_{t}$ and gain $K$ are both one-dimensional real numbers. Extending to vector-valued systems requires confronting a much richer singularity structure — the set of gains $K$ for which the matrix $I + B K$ becomes singular is a manifold, not a point, and the parity argument does not obviously generalize. This is the most significant open problem the paper leaves.

Second, the density-unknown rate $\tilde{O} (η^{- (2 s + 1) / (2 s)})$ is driven by the nonparametric estimation error of the KDE, which is minimax-optimal for $C^{s}$ densities but may be improvable if additional structure (e.g., log-concavity or a parametric family) is known. The parametric corollary already shows recovery of $\tilde{O} (1/ η)$ in that setting.

Third, the paper studies projected gradient descent — the gain $K$ is constrained to a compact subset of the stabilizing region, and a preliminary phase (analyzed in the supplement) is required to find a valid starting point. The dependence of initialization cost on problem constants is characterized but adds a layer that a fully global algorithm would need to absorb.

Fourth, the analysis assumes the noise support is bounded away from zero ($b_{\min} > 0$), which ensures $B_{t}$ is uniformly positive. Systems where the noise can vanish or change sign introduce qualitatively different behavior.

What the paper opens up is arguably as interesting as what it closes. The parity-cancellation mechanism — using the odd symmetry of a Cauchy kernel to cancel gradient variance — is a general device. Any stochastic optimization problem where the gradient integrand has a simple pole that is forced into the interior of the integration domain by optimality conditions might be amenable to the same treatment. The connection to risk-sensitive control suggests a unified view: rather than treating the $θ \to 0$ limit as a degenerate edge case, it might be studied as the natural endpoint of a family with progressively improving properties as $θ$ increases, with the paired estimator serving as the bridge.

For the adaptive control community, the key message is that the singular structure of log-growth control — previously viewed as a roadblock — is in fact informative. The singularity moves with the gain, its location tells you where $K^{*}$ is, and its symmetry is what makes the learning problem tractable. The cusp is not a bug. It is, in a precise sense, the feature.

The Singular Optimum: How a Mathematical "Cusp" Was Cracked to Make Control Systems Learn Faster