The AI That Reads Your Gut Like a Clinician: A

Somewhere inside a patient's small intestine, there is a bleeding vessel no bigger than a freckle. Finding it requires watching eight hours of footage — roughly the runtime of three feature films — captured by a camera the size of a large vitamin pill as it tumbles through 6 meters of gut. That camera takes about two frames per second. By the time it exits, it has produced up to 100,000 images. Clinically, perhaps eight of them matter.

This is capsule endoscopy (CE): one of modern medicine's most elegant diagnostic tools, and one of its most exhausting to interpret. A swallowed capsule with a built-in camera and LED light streams continuous footage of the gastrointestinal tract, reaching places a conventional endoscope cannot — the deep small bowel — without sedation or incisions. It has become essential for diagnosing conditions like obscure gastrointestinal bleeding, Crohn's disease, and small-bowel polyps. But the sheer volume of footage means even experienced gastroenterologists spend over an hour reviewing each study, and the cognitive load is enormous. Missed lesions are a real risk.

Automation seems like an obvious solution. But according to a new study by Liu et al. (2025), the existing automated approaches are badly mismatched to the actual clinical problem — and the paper's proposed fix is conceptually elegant: build an AI that thinks the way a clinician thinks.

The Science

Figure 1:
(a) General videos typically contain task-relevant evidence that is temporally dense and visually salient, as reflected by the larger red box and denser timeline;
(b) CE videos instead contain diagnostic evidence that is temporally sparse and visually subtle, as reflected by the smaller red box and sparser timeline;
(c) Uniform sampling selects clinically irrelevant frames due to the sparse nature of lesions, leading to missed diagnoses;
(d) Keyframe sampling captures unsatisfactory representative frames (e.g., bubbles), resulting in misdiagnosis;
(e) DiCE (Ours) adaptively extracts context-aware evidence clips from the raw video, rather than selecting a fixed frame budget, to generate clinically oriented summaries and support correct diagnosis. — Figure 1: (a) General videos typically contain task-relevant evidence that is temporally dense and visually salient, as reflected by the larger red box and denser timeline; (b) CE videos instead contain diagnostic evidence that is temporally sparse and visually subtle, as reflected by the smaller red box and sparser timeline; (c) Uniform sampling selects clinically irrelevant frames due to the sparse nature of lesions, leading to missed diagnoses; (d) Keyframe sampling captures unsatisfactory representative frames (e.g., bubbles), resulting in misdiagnosis; (e) DiCE (Ours) adaptively extracts context-aware evidence clips from the raw video, rather than selecting a fixed frame budget, to generate clinically oriented summaries and support correct diagnosis. Source: Bowen Liu, Li Yang

The paper introduces two things simultaneously: a new task and a new method to solve it. The task is called diagnosis-driven CE video summarization. Rather than asking whether a single frame contains a lesion (the standard framing in most CE AI research), it asks something harder and more clinically useful: given the full video, which frames constitute the evidence that would support a clinical report, and what does that evidence diagnose?

To build and test their system, the researchers created VideoCAP, the first CE dataset built directly from real clinical reports rather than retrospective image labeling. It contains 240 full-length patient videos from Shanghai Renji Hospital, split into 160 for training, 40 for validation, and 40 for testing. Each video's annotations were derived from the actual diagnostic reports issued during routine care — meaning only findings that genuinely contributed to a patient's diagnosis were included, verified by three senior gastroenterologists. The taxonomy covers 12 lesion types: ulcer, erosion, angioectasia, mucosal erythema, eminence lesion, hematocele, lymphangiectasia, lymphoid follicular hyperplasia, polyp, parasite, intestinal fluid accumulation, and normal small intestinal mucosa.

Figure 3: Overview of DiCE. A Selector first filters the raw video into a high-recall candidate pool. Each retained frame is encoded as a spatio-temporal token combining appearance and temporal position. The Context Weaver then constructs a two-level hierarchy of coarse anatomical contexts and fine lesion contexts. Finally, the Evidence Converger aggregates frame-level predictions within each lesion context into a stable context-level diagnosis and removes inconsistent frames, yielding a concise diagnostic summary. Source: Bowen Liu, Li Yang

Key Facts

8–12 hours Video length per exam A single capsule endoscopy examination produces an ultra-long video stream requiring over an hour of physician review time.

~100,000 Max frames per video A standard CE video can contain up to 100,000 frames, of which fewer than 10 may be diagnostically relevant.

240 videos VideoCAP dataset size The first CE dataset with diagnosis-driven annotations derived from real clinical reports, covering 12 lesion categories.

12 Lesion categories Including ulcer, polyp, angioectasia, erosion, and 8 others aligned with standard CE reporting guidelines.

8% Useful frames from existing AI Deployment studies found that after frame-level AI analysis, only 8% of selected images contained significant lesions.

~2× best baseline DiCE Diagnostic Yield DiCE achieved roughly double the diagnostic yield of the next-best keyframe selection method on the VideoCAP test set.

Their proposed framework, DiCE (Divide-then-Diagnose for Capsule Endoscopy), mirrors the standard CE reading workflow in three stages. First, a lightweight Selector ($\mathcal{S}$) screens every frame in the raw video using a frozen vision backbone $f_{θ}$ and a small classification head $h_{ψ}$ :

$s_{t} = h_{ψ} (f_{θ} (I_{t})) \in [0, 1]$

Frames scoring above a threshold $τ_{s}$ are kept as candidates; the rest are discarded. This dramatically shrinks the problem before any expensive reasoning begins. Second, a Context Weaver ($\mathcal{W}$) organizes those candidates into a two-level hierarchy. Each frame is first encoded as a spatio-temporal token $v_{t} = (u_{t} \oplus e_{t})$ — a fusion of its visual features and a sinusoidal encoding of its position in time — so the system knows not just what a frame looks like but where in the gut it was captured. The Weaver then groups frames into broad anatomical contexts (coarse: think "proximal jejunum" versus "terminal ileum") and then into fine-grained lesion contexts, each ideally dominated by a single underlying finding. Third, an Evidence Converger ($\mathcal{E}$) aggregates predictions across all frames in a lesion context into a single stable diagnosis:

$P_{i, j} = I_{t} \in H_{i, j} \sum p_{t}, \overset{y}{^}_{i, j}^{(0)} = ar g k max P_{i, j} [k]$

Where individual frames might flip wildly in their predictions due to motion blur or debris, the sum of their probability distributions tells a more consistent story. Frames that contradict the context consensus are then filtered out in a refinement step, and contexts that duplicate findings already captured elsewhere are pruned in a final inter-context pass. The output is a compact visual summary: a short list of representative keyframes, each labeled with a diagnosis.

The evaluation is unusually rigorous for this domain. The researchers measure performance at three levels — lesion, keyframe, and patient — using metrics designed to reflect clinical reality. A selected frame only counts as a correct detection if it falls within ±300 seconds of the annotated diagnostic keyframe and carries the right predicted label. Two nearby frames with conflicting labels are both scored as wrong. Redundancy is explicitly penalized.

What They Found

Lesion Detection Rate by Method (Full Training)

Lesion Detection Rate (LDR) measures the proportion of clinically reported findings that are both temporally matched and assigned the correct lesion category. DiCE outperforms all baselines.

Lesion Detection Rate by Method (Full Training)
Label	Value
Uniform Sampling (Qwen2-VL)	14.3
Uniform Sampling (InternVL3)	16.1
Keyframe Selection (AKS)	19.8
Keyframe Selection (TSPO)	22.4
DiCE (Ours)	34.7

The headline result is consistent across every metric: DiCE outperforms all baseline methods. The baselines include strong general-purpose long-video models — systems from the QwenVL and InternVL families that represent the current state of the art in video understanding — as well as keyframe selection methods. None of them were designed for the specific sparsity and subtlety of CE video, and it shows.

Redundancy Rate by Method (Full Training)

Redundancy measures the fraction of selected frames that contribute no new lesion information. Lower is better. DiCE produces the most concise diagnostic summaries.

Redundancy Rate by Method (Full Training)
Label	Value
Uniform Sampling (Qwen2-VL)	78.2
Uniform Sampling (InternVL3)	74.6
Keyframe Selection (AKS)	69.3
Keyframe Selection (TSPO)	65.1
DiCE (Ours)	41.8

The failure modes of the baselines reveal why the problem is hard. Uniform sampling — spreading frame selection evenly across the video — simply misses lesions most of the time, because a randomly chosen frame from an 8-hour CE study has roughly a 0.01% chance of being diagnostically meaningful. Keyframe selection methods, which try to pick the most visually "representative" frames, do worse in a different way: they end up selecting visually prominent but diagnostically useless images — bubbles, specular highlights, motion-blurred walls — because those dominate the visual landscape.

(a) Keyframe label timeline for a representative patient. The shaded region highlights a short interval with frequent label changes. Source: Bowen Liu, Li Yang

The label inconsistency analysis is particularly striking. The researchers show that nearby frames in strong baseline methods frequently receive contradictory diagnoses. In a representative patient's timeline, the shaded regions of rapid label switching correspond exactly to real lesion events — moments when the capsule is jostling past something important, the view is unstable, and frame-level models are flipping between "normal," "erosion," and "angioectasia" with each passing second. DiCE's context-level aggregation smooths this out, converting the noisy signal into a confident, consistent judgment. The short-range label inconsistency rate — the fraction of neighboring frame pairs that disagree on diagnosis — is substantially lower for DiCE than for any baseline (Liu et al., 2025).

VideoCAP itself also reveals important structure in the clinical problem.

Figure 2: Dataset statistics. Source: Bowen Liu, Li Yang

The dataset's statistics show that lesion events are genuinely rare: the distribution across the 12 lesion categories is highly skewed, with normal mucosa vastly outnumbering pathological findings. This mirrors real clinical practice and is precisely what makes the task hard for methods that assume diagnostic content is distributed evenly through a video.

Diagnostic Yield by Method (Full Training)

Diagnostic Yield (DY) is the fraction of patients for whom ALL clinically reported findings are successfully detected — the most demanding patient-level metric.

Diagnostic Yield by Method (Full Training)
Label	Value
Uniform Sampling (Qwen2-VL)	5
Uniform Sampling (InternVL3)	7.5
Keyframe Selection (AKS)	10
Keyframe Selection (TSPO)	12.5
DiCE (Ours)	22.5

Why This Changes Things

The practical stakes here are significant. Deployment studies cited in the paper found that when existing frame-level AI is applied to raw CE video, only 8% of the frames it flags as significant actually contain meaningful lesions — and physician review time still exceeds one hour. The automated screening is creating work rather than eliminating it: physicians must wade through a wall of false positives. DiCE's architecture directly addresses this by building redundancy and specificity into its output.

The conceptual contribution may be even more important than the benchmark numbers. For most of the history of medical image AI, the implicit assumption has been that clinical judgment can be decomposed into isolated image classifications. A model looks at a single frame; it says "polyp" or "not polyp." Accuracy on that binary task has been the metric of progress. What DiCE's authors argue — persuasively — is that this assumption is wrong for CE, and probably for other long-horizon diagnostic tasks too.

Real clinical reasoning is sequential and contextual. A gastroenterologist reading a CE study does not evaluate each frame in isolation. They notice that the capsule has been in one region for an unusually long time; they remember that three minutes ago they saw a suspicious discoloration in what looked like the same segment of bowel; they hold multiple tentative hypotheses simultaneously and update them as new frames arrive. The Evidence Converger is a computational model of exactly this process: it treats the lesion context as the unit of reasoning, not the frame, and it actively seeks consistency across observations before committing to a diagnosis.

This is a meaningful shift in paradigm, and it has implications beyond CE. Colonoscopy videos, bronchoscopy, surgical footage, cardiac catheterization — many diagnostic procedures produce long, sparsely informative video streams that current AI handles poorly for the same reasons. The two-level hierarchical context structure (anatomical context → lesion context) may generalize naturally to any procedure with known anatomical progression.

There is also a data contribution that deserves separate emphasis. VideoCAP is not just a benchmark — it is a model for how medical AI datasets should be built. Prior CE datasets compiled curated image collections that labeled every visible abnormality, whether or not it mattered to the patient's actual diagnosis. The result was a distorted training signal: models learned to recognize visually obvious lesions in clean conditions, not to identify the subset of findings that actually change clinical management. By deriving annotations directly from clinical reports, VideoCAP encodes something more valuable: clinical significance. A finding in VideoCAP is there because a gastroenterologist decided it mattered enough to write into a report. That is a hard-won label.

What's Next

DiCE is not a finished clinical product. The paper itself is careful about what the experiments do and do not prove. The 240 videos come from two clinical centers of a single hospital in Shanghai, and performance on more diverse patient populations — different capsule hardware, different GI pathology prevalence, different clinical documentation practices — remains to be shown. The 12-category taxonomy, while clinically grounded, may not cover the full spectrum of findings encountered in global CE practice.

The matching rule used in evaluation — a ±300 second window around a diagnostic keyframe — is generous by some clinical standards. A 5-minute window means a detection could be several intestinal segments away from the actual lesion. Tightening this tolerance would stress-test the system's spatial precision, not just its categorical accuracy, and future iterations will need to demonstrate localization at higher resolution.

The Context Weaver's hierarchical grouping is driven by joint temporal-visual affinity encoded in the spatio-temporal tokens, but the grouping algorithm's behavior with unusual capsule trajectories — prolonged gastric retention, rapid transit through the ileum, retrograde motion — is not fully characterized. CE is full of edge cases that confound even human readers; an AI system will need to handle them gracefully before clinical adoption.

Perhaps the most interesting open question is multimodality. Clinical CE reports often integrate findings from the video with patient history, prior studies, and laboratory results. The current DiCE framework operates on video alone. A version that could cross-reference a patient's iron-deficiency anemia or prior Crohn's diagnosis while reading the CE video would be doing something much closer to real diagnostic reasoning — and could, in principle, achieve still lower rates of missed findings.

What DiCE establishes, clearly and rigorously, is that the paradigm matters as much as the model. Throwing a more powerful general-purpose vision-language model at a raw CE video does not solve the problem; structuring the reasoning process to match the clinical task does. For a field that has spent years optimizing frame-level accuracy metrics that poorly predict real-world clinical utility, that reframing is the most important finding of all. The capsule is already doing its job. The question has always been whether the software reading its footage can think at the level of the physician who ordered the test.

DiCE suggests the answer, increasingly, is yes.

The AI That Reads Your Gut Like a Clinician: A Smarter Way to Review Capsule Endoscopy

The Science

What They Found

Why This Changes Things

What's Next