The AI That Learns Where the Pancreas Is Before It Looks for Cancer

Pancreatic cancer kills roughly 88% of patients within five years of diagnosis. The main reason isn't that treatment is hopeless — it's that the tumor is almost always found too late. Catching it early enough to matter requires spotting a lesion that can be just a few millimeters across, nestled against soft tissue that looks nearly identical on a CT scan, in an organ tucked behind the stomach and draped over the spine. Even experienced radiologists miss early-stage tumors. And AI models trained to help? They have a reliability problem that the field has been slow to confront.
The specific failure mode is called cohort shift: a model trained on scans from one hospital network degrades sharply when it encounters scans from another — different scanner brands, different contrast injection timing, different slice thicknesses, different patient demographics. The model essentially learns to recognize tumors plus invisible fingerprints of the training institution, and when those fingerprints change, performance collapses. For nnU-Net, the current gold-standard architecture in medical image segmentation, "collapse" is not a metaphor: tested on out-of-cohort pancreas scans, it achieved a tumor Dice score (a measure of overlap with the ground truth, from 0 to 1) of just and detected tumors in only of positive patient cases (Ma & Ma, 2026). That is essentially random performance.
PanGuide3D, introduced in this paper, is a direct answer to that failure. It achieves a tumor Dice of and detects tumors in of out-of-cohort positive cases. The approach is not a bigger model or more data — it is a smarter structural idea: teach the network to first ask where is the pancreas?, and then use the probabilistic answer to that question to guide its search for cancer.
The Science
The core intuition behind PanGuide3D is borrowed from how a radiologist actually reads a scan. You don't search the entire abdomen for a suspicious mass. You find the pancreas, then examine it. Prior AI systems have tried to formalize this with "cascaded" pipelines — model one detects the organ, model two crops to that region and searches for tumor. But hard cascades are brittle: if step one is wrong, step two has no way to recover, and running two separate models adds engineering complexity and inference time.
PanGuide3D replaces the hard cascade with a probabilistic soft gate. A single shared encoder — built on the nnU-Net framework (Isensee et al., 2021), a 3D convolutional architecture that processes CT volumes in overlapping patches — feeds two parallel decoders. One decoder produces a probabilistic pancreas map: not a binary mask, but a continuous field of values between 0 and 1 representing the likelihood that each voxel (three-dimensional pixel) belongs to the pancreas. That map is then injected into the second decoder — the tumor head — at multiple spatial scales simultaneously, gating tumor predictions toward anatomically plausible locations. Because the gating is differentiable (meaning gradients flow through it during training), the whole system learns end-to-end; the pancreas predictor is rewarded not only for finding the pancreas but for finding it in a way that helps locate tumors.
A second structural addition sits at the network's deepest layer — the bottleneck, where feature maps are most compressed and abstract. At this level, PanGuide3D inserts a lightweight Transformer module: an attention mechanism that lets every spatial location look at every other location simultaneously, rather than only at nearby voxels as convolutions do. The purpose is global coherence. Convolutional networks excel at local texture; they can mistake a small brightening in the bowel for a pancreatic tumor if local cues are similar. The Transformer bottleneck adds the question: does this suspicious spot make sense given everything else in the scan? Under cohort shift, when local intensities drift, global anatomical logic becomes a crucial stabilizer.
The evaluation framework was deliberately stringent. The model was trained on PanTS, a large pancreatic tumor segmentation cohort of 7,200 training scans — but with a twist that makes it unusually demanding: only 9.5% of those scans contain any tumor at all. The model must learn to confidently say "no tumor" on 90% of cases while still detecting subtle lesions on the other 10%. Testing was then performed both on held-out PanTS data (in-cohort) and on MSD Task07, the Medical Segmentation Decathlon pancreas benchmark, a fully independent cohort where every one of 57 cases contains a tumor. The two cohorts were preprocessed identically — resampled to 1.5 mm isotropic spacing — to ensure that performance differences reflect genuine architectural differences rather than pipeline artifacts.
What They Found
Tumor Detection: Out-of-Cohort (MSD) Patient Sensitivity
Fraction of tumor-positive patients correctly detected by each model on the out-of-cohort MSD dataset. Higher is better. PanGuide3D dramatically outperforms all baselines.
| Label | Value |
|---|---|
| nnU-Net | 0.068 |
| RADFM | 0.333 |
| SwinUNETR | 0.602 |
| 2Decoder (ablation) | 0.737 |
| TransBNeck (ablation) | 0.772 |
| PanGuide3D | 0.842 |
The headline result is the cross-cohort detection numbers, but the table of full results reveals the texture of what's actually happening. Standard nnU-Net, the field's workhorse, is catastrophically non-robust: its false-positive tumor volume on out-of-cohort data is — meaning roughly a golf ball's worth of tissue per scan is incorrectly flagged as cancer. PanGuide3D reduces that to , a tenfold improvement. Each unnecessary cubic centimeter of false positive potentially means an anxious patient awaiting a biopsy that reveals nothing.
The ablation experiments — two stripped-down variants of PanGuide3D tested separately — are illuminating. The "2Decoder" model keeps the dual-decoder design with pancreas conditioning but removes the Transformer bottleneck. It already jumps to a tumor Dice of on MSD (versus nnU-Net's 0.066$), demonstrating that the anatomical conditioning alone accounts for the majority of the gain. Add the Transformer bottleneck back in (the "TransBNeck" variant) without the pancreas conditioning, and Dice reaches $0.421. Combine both, and PanGuide3D reaches . The ingredients are complementary, not redundant.
False-Positive Tumor Volume: Out-of-Cohort (MSD)
Mean volume of tissue incorrectly flagged as tumor per scan (cm³). Lower is better. PanGuide3D produces ~10× less spurious tumor signal than standard nnU-Net.
| Label | Value |
|---|---|
| nnU-Net | 19.646 |
| RADFM | 8.666 |
| SwinUNETR | 8.55 |
| 2Decoder (ablation) | 4.781 |
| TransBNeck (ablation) | 4.291 |
| PanGuide3D | 2.004 |
The size-stratified analysis
reveals where the clinical stakes are highest. Small tumors — the ones that matter most for early detection — are where conventional models fail most dramatically. nnU-Net produces near-zero Dice on many small-lesion cases, particularly after cohort shift. PanGuide3D maintains a consistent band of higher Dice across the small-volume regime. The anatomical conditioning is doing what you'd hope: even when a tiny lesion looks like noise, the pancreas prior pulls the model's attention to the right neighborhood and keeps it there.
Performance was also stratified by anatomical location — head, body, and tail of the pancreas. The body and tail are harder targets because tumors there are often farther from the bile duct, tend to present later, and have fewer distinctive imaging features. PanGuide3D showed advantages across all three subregions, with particular clarity in the body and tail cases that trip up other models.
Finally, the calibration analysis — comparing a model's stated confidence to its actual correctness — showed PanGuide3D tracking closer to perfect calibration than nnU-Net across most confidence levels. Calibration matters enormously in clinical deployment: a model that says "90% confident this is cancer" when it's actually right 40% of the time is not just inaccurate, it's actively misleading. PanGuide3D is not perfectly calibrated (it tends toward overconfidence at the highest end), but it is substantially more reliable than the baseline.
Why This Changes Things
Tumor Dice Score: In-Cohort vs. Out-of-Cohort
Dice score (0–1 overlap quality) for each model, comparing in-cohort (PanTS) vs. out-of-cohort (MSD) performance. The gap reveals cohort shift vulnerability.
| Label | Value |
|---|---|
| nnU-Net | 0.202 |
| RADFM | 0.255 |
| SwinUNETR | 0.336 |
| 2Decoder | 0.367 |
| TransBNeck | 0.412 |
| PanGuide3D | 0.46 |
The pancreatic cancer AI problem has two distinct failure modes, and most prior work has only addressed one. The first failure mode is in-cohort performance — can the model find tumors in the data it was trained on? Enormous effort has gone into this, producing models with impressive benchmark numbers. The second failure mode is cross-cohort reliability — does the model still work when it encounters scans from a different institution? This is the failure mode that determines whether a model is actually deployable in healthcare, and it has received far less attention (Ma & Ma, 2026).
The gap PanGuide3D exposes is jarring. Even RADFM, a large-scale medical foundation model trained on web-scale radiology data, achieves only patient sensitivity on out-of-cohort scans. SwinUNETR — a modern Transformer-based architecture specifically designed for volumetric medical images — reaches . These are not weak baselines; they are the state of the art. PanGuide3D at is not a marginal improvement. It represents a qualitatively different level of reliability.
The mechanism matters as much as the number. Previous organ-guided approaches used hard masks: find the pancreas, crop to it, run a second model. The problem is that "find the pancreas" can itself go wrong under cohort shift — contrast timing differences can blur organ boundaries, unusual anatomy can confound localization — and a hard crop means errors propagate catastrophically. Soft probabilistic conditioning sidesteps this. The tumor decoder never gets a binary instruction ("the pancreas is here, ignore everything else"); it gets a graded map ("the pancreas is probably here, maybe here, unlikely there") that preserves uncertainty and lets the tumor head make joint inferences. This is much closer to how diagnostic reasoning actually works.
The reduction in false positives has a dimension that pure accuracy metrics miss. In clinical workflow, a false positive tumor prediction is not a neutral event. It triggers follow-up imaging, potentially biopsy, definitely patient anxiety. A model that is technically "competitive" in overlap metrics while generating of spurious tumor signal per scan would, in practice, be unusable. PanGuide3D's false-positive volume of is not perfect, but it is in a range where radiologist review could be efficient rather than exhausting.
The architecture also has a practical virtue that matters for adoption: it is not large. It does not require massive pretraining on proprietary datasets. It is built on nnU-Net — freely available, widely used, understood by medical imaging teams worldwide — with targeted additions (a second decoder, a lightweight Transformer module) that add minimal computational cost. The barrier to adoption is much lower than for foundation models, which require infrastructure and data that most hospitals don't have.
What's Next
The paper acknowledges what it cannot yet claim. The MSD out-of-cohort evaluation, while rigorous, uses only 57 cases — enough to establish meaningful trends but not enough to characterize performance across the full diversity of real-world scanner environments. A prospective multi-institutional study, following the model through an actual clinical workflow, would be the necessary next step before deployment decisions.
The calibration gap also deserves attention. PanGuide3D is better calibrated than its competitors, but the residual overconfidence at high predicted-probability thresholds means that clinical users should not treat the model's confidence scores as direct probabilities. Future work might combine the architectural advances here with explicit calibration techniques — temperature scaling, conformal prediction, or Bayesian extensions of the uncertainty estimation — to produce probability outputs that are genuinely trustworthy.
There's also an interesting open question about the pancreas map itself. In PanGuide3D, the pancreas decoder is supervised by ground-truth pancreas annotations and then used to guide tumor decoding. But the quality of that guidance depends on how well the pancreas map generalizes under shift. The researchers found that pancreas Dice remains relatively robust across cohorts (unlike tumor Dice), which is why soft conditioning works — but in settings with more severe organ-level shift, the pancreas prior might itself become unreliable. Investigating when and why organ priors transfer more robustly than lesion priors is a meaningful theoretical question.
More broadly, the architecture here suggests a template applicable beyond pancreatic cancer. Many tumor segmentation problems have the same structure: a target lesion embedded within a host organ whose boundaries are generally easier to delineate. Liver lesions within the liver, lung nodules within the lung parenchyma, prostate tumors within the prostate gland — all of these could potentially benefit from the same probabilistic anatomical conditioning approach. If soft organ guidance proves as effective in those settings as it does here, it could shift the field's approach to organ-guided segmentation from a pipeline choice to an architectural principle.
Pancreatic cancer's lethality is largely a detection problem. The five-year survival rate for localized pancreatic cancer — caught before it spreads — is around 44%. For metastatic disease, it is 3%. Everything that nudges detection earlier, that reduces the miss rate on small tumors, that makes AI tools reliable enough to trust across institutions, moves patients from that second number toward the first. PanGuide3D does not solve the detection problem. But it demonstrates, with concrete numbers, that the field's standard approach to cross-cohort reliability has been inadequate — and it offers a structurally simple alternative that works substantially better. That's exactly the kind of finding that changes what the next generation of systems gets built to do.