A Simple Calibration Trick Makes AI-Assisted

Somewhere in a genomics lab, a researcher wants to know the average expression level of a gene across thousands of patients. Getting a gold-standard measurement for every patient is expensive and slow. But a machine-learning model, trained on historical data, can produce a predicted value for everyone — cheaply, instantly. The sensible thing is to use the model's predictions to extend the reach of the expensive measurements. This is prediction-powered inference: a statistical approach that blends a small set of labeled, verified observations with a large sea of AI-generated predictions.

The problem is that AI models lie about their units. A neural network might correctly identify that patient A is more likely to express this gene than patient B — its rankings are trustworthy — while its actual numerical outputs bear no meaningful relationship to the true scale of the outcome. When you naively feed those numbers into a statistical formula, the residuals (the gaps between the model's guess and reality) stay stubbornly large. The confidence interval stays stubbornly wide. You've done the expensive measurements. You've deployed the fancy model. And you're still not getting the precision you should.

The insight at the center of this paper is that fixing this problem is simpler than it sounds. Van der Laan & Van der Laan (2026) propose Calibrated Prediction-Powered Inference: a framework that first corrects the prediction model's output scale using the labeled data you already have, then plugs that corrected score into the statistical estimator. No retraining. No new data. Just a post-processing step that takes maybe a few seconds to run — and can shrink confidence intervals by 17%.

The Science

The setting is called semisupervised mean estimation. You have two datasets: a small labeled sample of $n$ observations where both covariates $X$ and outcomes $Y$ are known, and a large unlabeled sample of $N$ observations where only covariates are available. Your goal is to estimate the population mean $ψ_{0} = E [Y]$ — say, average exam performance, average gene expression, or average model accuracy on a benchmark — as precisely as possible.

The workhorse for this problem is the augmented inverse-probability weighted (AIPW) estimator, a technique from the 1990s missing-data literature (Robins et al., 1994). AIPW takes a prediction score $f (X)$ , averages it over both labeled and unlabeled samples to get a broad estimate, then uses the labeled residuals $Y - f (X)$ to correct any systematic error. Crucially, AIPW is safe: for any choice of $f$ , the estimator remains unbiased. A bad score inflates your uncertainty but never biases your answer.

$ψ (f) = ρ_{n} P_{n}^{L} {f (X)} + (1 - ρ_{n}) P_{N}^{U} {f (\tilde{X})} + P_{n}^{L} {Y - f (X)}$

Here $ρ_{n} = n / (n + N)$ is the fraction of observations that are labeled. Recent work in the machine-learning community branded a version of this framework prediction-powered inference (PPI) (Angelopoulos et al., 2023), bringing it to a wider audience with accessible software. The authors of this paper sit firmly in the older semiparametric tradition and spend some care clarifying that PPI and its variants are specific instances of AIPW — a genealogy that matters for understanding what's actually going on.

The key theoretical point is that efficiency — how tight your confidence intervals are — depends entirely on how well $f (X)$ approximates the true conditional mean $μ_{0} (x) = E [Y ∣ X = x]$ . Not how well it ranks observations, but how close its actual numeric values are to the outcome values. A score trained to maximize ranking accuracy on a different dataset, using a different loss, at a different time, may rank beautifully and approximate terribly. The paper's intervention is to use the labeled data to calibrate $f$ — to bend its output so the numbers line up with the outcome scale — before feeding it into the estimator.

Two flavors of calibration are studied in detail. Linear calibration fits a simple affine transformation $\overset{a}{^} \cdot f (X) + \hat{b}$ to the labeled data. The paper proves this is asymptotically equivalent to PPI++ (Angelopoulos et al., 2023), a "power-tuned" variant of PPI — a result that elegantly unifies two approaches that look quite different on the surface. Isotonic calibration fits a monotone step function to the data using isotonic regression, a nonparametric method that imposes only the constraint that the transformation be non-decreasing. This is more flexible and, as the theory shows, more powerful.

The research was conducted by Lars van der Laan at the University of Washington and Mark van der Laan at UC Berkeley, two statisticians with deep roots in targeted learning and semiparametric efficiency theory. Their theoretical results are accompanied by simulations and real-data experiments, plus an open-source Python package (ppi_aipw) that anyone can install and use immediately.

What They Found

Figure 1: Toy example illustrating how calibration can improve prediction-powered mean estimation. The example is derived from the MATH slice of a public LLM-evaluation benchmark with binary human correctness labels, using the ArmoRM evaluator’s raw margin as the prediction score. The target is the average human correctness ψ=𝔼[Y]\psi=\mathbb{E}[Y], estimated from a small labeled sample together with many unlabeled evaluator scores. Left: the raw margin contains clear signal about correctness, but it is not on the correct numerical scale. Middle: a one-dimensional Platt calibration fit on 100 labeled examples maps the raw margin to estimated correctness probabilities and reduces held-out squared error by about 57%57\%. Right: in Monte Carlo experiments with N=2000N=2000 unlabeled scores, calibrated PPI yields substantially shorter 90%90\% confidence intervals for ψ\psi than both labeled-only estimation and raw PPI; the mean interval length is about 17%17\% smaller at n=25n=25 labeled examples and about 15%15\% smaller at n=50n=50. Source: Lars van der Laan, Mark Van Der Laan

The motivating example in the paper — shown in Figure 1 — is perhaps the most immediately vivid. The task: estimate the average human-rated correctness of an AI language model on the MATH benchmark. The labeled sample is a small set of human annotations. The unlabeled sample is the raw output of an AI evaluator (ArmoRM) applied to thousands of problems. The evaluator's raw "margin" score contains genuine signal — it really does track whether the model gets things right — but it's on a completely different numerical scale than the binary correctness labels.

After fitting a one-parameter Platt scaling calibration on just 100 labeled examples, the held-out squared error drops by 57% (van der Laan & van der Laan, 2026). The calibrated score now lives in the right ballpark numerically. Feed it into the AIPW estimator, and the confidence intervals for the average correctness shrink substantially compared to using the raw score. In Monte Carlo simulations with $N = 2000$ unlabeled scores, the mean 90% confidence interval is 17% shorter at $n = 25$ labeled examples, and 15% shorter at $n = 50$ .

Confidence Interval Reduction from Calibration (LLM Benchmark)

Mean 90% confidence interval length for estimating average LLM correctness on the MATH benchmark, comparing labeled-only estimation, raw PPI, and calibrated PPI. N=2000 unlabeled scores. Values show relative interval length (raw PPI = 1.0).

Confidence Interval Reduction from Calibration (LLM Benchmark)
Label	Value
Labeled-only (n=25)	1.28
Raw PPI (n=25)	1
Calibrated PPI (n=25)	0.83
Labeled-only (n=50)	1.18
Raw PPI (n=50)	1
Calibrated PPI (n=50)	0.85

The synthetic simulation study in Figure 2 systematically tests these ideas across different ratios of labeled to unlabeled data. When the unlabeled sample is much larger than the labeled one ($N/n = 16$ — the regime most relevant for real applications), isotonic-calibrated PPI consistently achieves lower standard deviation and higher relative efficiency than the uncalibrated PPI baseline. Coverage — the probability that a 95% confidence interval actually contains the true value — stays close to the nominal 95% level across all estimators that correctly account for uncertainty.

Calibration Reduces Prediction Error by 57%

Held-out squared prediction error for the ArmoRM evaluator on binary human correctness labels (MATH benchmark), before and after Platt scaling calibration on 100 labeled examples. Lower is better.

Calibration Reduces Prediction Error by 57%
Label	Value
Raw Score (uncalibrated)	1
After Platt Calibration	0.43

The real-data experiments (Figure 3 and Figure 4) span datasets from the original PPI papers — genomics, economics, and ecological measurements — as well as a new LLM-evaluation benchmark. Across these varied contexts, the calibrated estimators generally outperform raw PPI and are competitive with or better than AIPW and PPI++. No single method wins on every dataset, which is expected: calibration helps most when the original score is miscalibrated, and some scores are already well-aligned.

Figure 3: Main-text benchmark summary for the reproduced ppi_py datasets. Panels report normalized MSE relative to PPI, relative efficiency versus PPI, and coverage for the main comparator set. Source: Lars van der Laan, Mark Van Der Laan

The theoretical heart of the paper is the isotonic calibeating result — the term "calibeating" is borrowed from the forecasting literature (Foster & Hart, 2023) and means beating a benchmark via calibration. The paper proves two things. First, the isotonic-calibrated score cannot be beaten by any further monotone post-processing: it is first-order optimal within the class of monotone transformations. Second, the resulting mean estimator is first-order equivalent to an oracle estimator that uses the true conditional mean $E [Y ∣ f (X)]$ — the best possible regression adjustment you could construct from $f$ alone. These are strong guarantees for what is, in implementation, a very simple procedure.

Why This Changes Things

The implications ripple outward in several directions.

For AI-assisted science. The practice of supplementing expensive measurements with cheap model-generated predictions is everywhere: genomics studies that use phenotype predictors trained on electronic health records, clinical trials that use prognostic models trained on historical data, ecology surveys that use satellite imagery models. In all of these settings, the prediction model was trained somewhere else, for some other purpose, with some other loss function. Miscalibration is not a bug — it's the default. This paper provides a simple, theoretically grounded fix that any analyst can apply without touching the underlying model.

For LLM evaluation specifically. The paper's application to large language model benchmarks is timely and practically important. As AI systems proliferate, the question of how well they perform on a given task becomes economically consequential. Human evaluation is expensive; AI evaluators are cheap but miscalibrated. The calibrated PPI framework offers a statistically rigorous way to combine both — getting the precision of human judgment extended by the scale of automated evaluation, with formal uncertainty quantification.

Figure 4: PPE-centered LLM-evaluation benchmark. Panels report normalized MSE relative to PPI, relative efficiency versus labeled-only, and coverage after macro-averaging across the public PPE evaluator models. We omit RewardBench from the main-text figure until its separate robustness run completes cleanly. Source: Lars van der Laan, Mark Van Der Laan

The PPE benchmark results in Figure 4 show that the gains are not just a lab curiosity.

For the statistical literature. The paper performs a useful act of intellectual housekeeping. The PPI literature, which emerged largely from the machine-learning community, rediscovered several ideas from the 1990s semiparametric literature under new names and with new software. The authors show explicitly that PPI is a special case of AIPW, that PPI++ is AIPW with empirical efficiency maximization (Rubin & van der Laan, 2008), and that cross-PPI is AIPW with cross-fitting. This doesn't diminish PPI's contribution — accessible software and framing matter enormously for adoption — but it clarifies the theoretical foundations and opens the door to importing decades of semiparametric theory into the PPI setting.

The optimality result has teeth. The first-order equivalence between isotonic-calibrated PPI and the oracle estimator means that, within the class of estimators that use monotone transformations of the original score, you cannot do better. If you have a prediction model and you want to estimate a population mean as efficiently as possible without retraining, isotonic calibration followed by AIPW is asymptotically unbeatable. That's a remarkably strong statement for something you can implement in a few lines of code.

The paper also clarifies a subtle failure mode of the original PPI estimator (Angelopoulos et al., 2023) that is easy to overlook. Standard PPI averages $f (X)$ only over the unlabeled covariates before applying the residual correction, discarding the labeled covariates' contribution to the plug-in term. AIPW uses both labeled and unlabeled covariates. When the unlabeled sample is large relative to the labeled sample, the difference is small. But in balanced settings ($N \approx n$), this omission can make PPI substantially less efficient than AIPW, even with a well-calibrated score. The calibrated PPI framework uses the AIPW form and thus avoids this waste.

What's Next

The framework as presented focuses on mean estimation — estimating $ψ_{0} = E [Y]$ . Many scientific questions demand more: quantiles, treatment effect heterogeneity, regression coefficients, survival curves. Extensions to general estimating equations have been explored in related work (Ji et al., 2025), and the authors note that their calibrated DML framework (van der Laan et al., 2024c) provides a broader scaffold. Translating the specific calibeating guarantees to those richer settings is a natural next step.

The small-sample regime deserves further attention. Isotonic regression is a nonparametric method; it needs enough labeled data to fit a useful monotone transformation. When $n$ is very small — say, fewer than 20 observations — the fitted isotonic score may not improve much over the raw score, or may overfit. The paper's experiments show that even with $n = 25$ , calibration helps, but the theoretical guarantees are asymptotic. Finite-sample bounds, perhaps via conformal prediction or bootstrap corrections, would make the framework safer in truly data-scarce regimes.

There is also the question of which calibration method to choose. The paper studies linear calibration (equivalent to PPI++), isotonic calibration (nonparametric, provably optimal among monotone transformations), Platt scaling (logistic link, natural for binary outcomes), and histogram binning. The AutoCal option in the accompanying package selects adaptively. How to choose well in practice — especially when $n$ is small and the prediction score's properties are unknown — is an open empirical and theoretical question.

One deeper question lurks beneath the surface. The paper's guarantees apply when the labeled and unlabeled samples are drawn from the same distribution. Distribution shift — when the model was trained on different data than the population you're studying — is precisely the setting where miscalibration is worst and calibration is most needed. But it also complicates the theory: if the labeled sample itself is drawn from a shifted distribution, calibration on it may not align the score with the target population. The authors acknowledge this but leave it for future work. Given that distribution shift is ubiquitous in real applications, this seems like fertile ground.

The Python package ppi_aipw is already available at larsvanderlaan.github.io/ppi-aipw, with documentation and worked examples. For any researcher who is already using prediction-powered inference — or who is planning to deploy a pre-trained model to extend a small labeled study — the barrier to adopting this framework is essentially zero. The calibration step takes seconds. The efficiency gains can be substantial. The formal guarantees mean you're not gambling on a heuristic.

In an era when data labeling is expensive and AI predictions are abundant, the question of how to combine them responsibly — not just cheaply — is one of the most practically important problems in applied statistics. This paper advances a principled answer: calibrate first, then infer. The math says it works. The experiments confirm it. The code is ready.

A Simple Calibration Trick Makes AI-Assisted Science Far More Reliable

The Science

What They Found

Why This Changes Things

What's Next