The AI That Reads the Whole Story: A Smarter Way

When a cyclone tears through a coastal region, or an outbreak of disease begins spreading through a city, the first structured accounts of what happened appear in news reports. Before any government dashboard updates, before any official briefing, journalists are already answering the five questions that matter most: who was involved, what happened, where, when, and why. For emergency responders, policy analysts, and disaster management systems, being able to automatically extract those answers — at scale, from thousands of documents — is not a convenience. It could be the difference between a coordinated response and a chaotic one.

That's the real-world stakes behind a new system called MODEE — Multimodal Open-Domain Event Extraction — developed by Praval Sharma at the University of Nebraska Omaha (Sharma, 2026). The system achieves something that has consistently eluded AI researchers: accurately pulling structured event information from documents covering any type of event, without being told in advance what kinds of events to look for. And it does so by combining two very different ways of understanding text — reading it as language, and reading it as a network of relationships.

The Science

To understand why this matters, it helps to know how the field has been stuck. Event extraction in AI generally falls into two camps. Closed-domain systems are trained with a fixed menu of event types — "Conflict-Attack," "Natural Disaster," "Corporate Merger" — along with predefined templates for the arguments each type requires. They work well within their lane but fail completely outside it. Open-domain systems, by contrast, attempt to handle any event type — but they've historically relied on handcrafted linguistic rules, and they break down when events are described in unexpected ways.

The obvious solution seems to be large language models (LLMs) — the same technology underlying ChatGPT and its cousins. LLMs have transformed nearly every corner of natural language processing. Yet for this specific task, they have a structural weakness. When a document is long — spanning multiple paragraphs and sentences, as news reports do — LLMs tend to lose track of information buried in the middle. This is called the lost-in-the-middle phenomenon: the model pays strong attention to the beginning and end of a document, but its attention dilutes for content in between. Events, which are typically described across multiple sentences, suffer directly from this limitation.

MODEE's insight is to treat this not as a language problem alone, but as a structure problem. Rather than feeding a document only as a sequence of words, Sharma's approach also converts it into a graph — a mathematical object where every word (or token) is a node, and every pair of nodes is connected by an edge. This complete graph, covering all token-to-token relationships across the entire document, is then processed by a graph neural network (GNN) — a type of AI architecture designed to learn from connected structures, like social networks or molecular bonds.

The resulting system has four components working in concert (

Figure 1: Overview of event extraction in MODEE. Source: Praval Sharma

). A text encoder (based on the T5 language model) reads the document as text and produces contextual embeddings — numerical representations of what each word means in context. A graph encoder (based on GraphSAGE, a graph neural network) processes the document's token graph and produces structural embeddings — representations of how each word relates to every other word across the document. An attention-based gated fusion module then combines these two streams, computing a score for each token that reflects how relevant it is to the event being described. Finally, a text decoder reads those fused, relevance-weighted representations and generates the five answers — where, when, what, who, and why — as plain text.

The fusion step is subtle but important. Rather than simply averaging the two sets of embeddings, MODEE uses a gating mechanism inspired by additive attention (first developed by Bahdanau et al., 2015 for machine translation). It computes a token-by-token score — a number between 0 and 1 for each word — that says, in effect: "how much does this word matter for describing the event?" Words with high scores get amplified in the final representation. Words that are irrelevant — background context, boilerplate phrasing — get down-weighted. This is what allows the system to focus accurately even in long documents where a naive language model would lose the thread.

Training MODEE required building a new dataset. The researchers assembled 10,000 news reports published between 2015 and 2019 from seven Indian newspapers spanning different regions and journalistic styles — including the Times of India, The Hindu, Economic Times, and Kashmir Observer. Three trained annotators labeled each report with the 5Ws of its main event, going through multiple rounds of calibration to reach an inter-coder reliability above 0.8 (measured using Krippendorff's alpha, a standard measure of annotation agreement). The dataset was split into 8,000 training, 1,000 validation, and 1,000 test documents.

What They Found

The headline result is striking: MODEE-Base, built on the modestly sized T5-Base language model, outperforms every baseline tested — including systems based on models orders of magnitude larger (Sharma, 2026).

Event Extraction Performance: MODEE vs. Baselines (F1, Exact Match)

F1 scores under exact match evaluation on the test set. MODEE-Base achieves the highest score, outperforming fine-tuned T5 models of all sizes and large LLMs under prompting.

Event Extraction Performance: MODEE vs. Baselines (F1, Exact Match)
Label	Value
MODEE-Base	0
MODEE-Small	0
T5-Base (fine-tuned)	0
T5-Small (fine-tuned)	0
T5-Large (fine-tuned)	0
Llama 3.1 70B (5-shot)	0
Qwen 3 32B (5-shot)	0
Giveme5W1H	0

Llama 3.1 at 70 billion parameters — one of the most powerful openly available language models — failed to match MODEE's performance under either zero-shot or five-shot prompting. (Zero-shot means the model is given only instructions; five-shot means it's shown five examples first.) Qwen 3 at 32 billion parameters performed similarly. Mistral V0.3 at 7 billion parameters also fell short. And Giveme5W1H, the leading rule-based open-domain system, was the worst performer of all — illustrating just how fragile heuristic approaches are in the face of diverse real-world language.

Importantly, even fine-tuned versions of T5-Large — a model larger than MODEE-Base, trained directly on the task — did not beat MODEE-Base. This is the key empirical finding. It means the performance gain doesn't come from having more parameters or more training data. It comes from the architectural decision to explicitly model document-level structure through graph-based learning. As Sharma writes: "Using a larger model does not compensate for the lack of document-level context, structure, and semantics of event-related tokens that MODEE captures through multimodal integration."

MODEE-Small (built on the smaller T5-Small) also outperformed its counterpart T5-Small, reinforcing that the benefit of graph integration holds regardless of model scale. The pattern is consistent: adding the graph modality helps, and it helps more than simply scaling up the language model.

MODEE-Base vs. T5-Base: Metric Comparison

Radar comparison across Exact Match F1, ROUGE-L F1, and BERTScore F1 for MODEE-Base versus the equivalent-architecture T5-Base fine-tuned baseline.

MODEE-Base vs. T5-Base: Metric Comparison
Label	Value
EM F1	0
ROUGE-L F1	0
BERTScore F1	0

The system was evaluated on three metrics. Exact match (EM) checks whether the extracted text precisely matches the gold-standard annotation — a strict test. ROUGE-L measures the longest overlapping sequence of words between the prediction and the reference, capturing partial credit for near-correct answers. BERTScore uses a language model to assess semantic similarity, catching cases where the system says "Mumbai" instead of "Bombay" and should still receive credit.

MODEE-Base achieved the best scores across all three metrics and both precision and recall — meaning it both finds the right information and avoids including wrong information.

The researchers also tested MODEE on a closed-domain document-level event extraction benchmark, even though it was designed for open-domain use. It outperformed existing closed-domain approaches there too (Sharma, 2026). This generalization is practically significant: a single system that works across both constrained and unconstrained settings reduces the cost and complexity of deployment.

Dataset Composition: Training, Validation & Test Split

The 10,000-document annotated dataset of Indian news reports is split across training, validation, and test sets used for MODEE's development and evaluation.

Dataset Composition: Training, Validation & Test Split
Label	Value
Training Set	8,000
Validation Set	1,000
Test Set	1,000

Figure 2: Example of a document from our dataset with the 5Ws for the main event annotated (highlighted). Source: Praval Sharma

Why This Changes Things

To appreciate the broader significance, consider what event extraction enables. Knowledge graphs — the structured databases that power much of modern AI reasoning — are built from extracted entities and relationships. Document clustering and summarization systems need to know what events documents are about before they can group or condense them. Decision support tools for emergency management need to rapidly synthesize reports from dozens of sources into actionable intelligence. All of these downstream applications are only as good as the event extraction layer underneath them.

Current approaches to that layer are brittle in a specific, costly way. Closed-domain systems require experts to predefine schemas for every event type they'll encounter — a labor-intensive process that inevitably lags behind the diversity of real-world events. When something genuinely novel happens (a new type of cyberattack, a previously unknown disease pathway), closed-domain systems simply don't know what to look for. Open-domain systems were supposed to solve this, but their reliance on rules made them equally fragile in the face of linguistic variation.

MODEE represents a different path: a system that learns, from data, to identify the structural and semantic signatures of event descriptions in text — signatures that generalize across event types because they reflect the underlying logic of how events get reported, not the surface-level vocabulary of any particular domain.

There's also something methodologically important about how MODEE handles "multimodality." In most multimodal AI research, modalities mean genuinely separate data sources: text and images, or text and audio. Those approaches require parallel annotations across sources, which is expensive and hard to scale. MODEE derives its second modality — the graph — from the same text that provides the first modality. This is a form of self-derived multimodality: enriching a single source of information by representing it in two structurally different ways simultaneously. The resulting system needs only annotated text documents to train, not elaborate paired datasets.

This has significant implications for low-resource settings. Building event extraction systems for languages or domains where parallel image-text datasets don't exist has historically been prohibitively expensive. MODEE's architecture sidesteps that constraint entirely.

The choice of Indian newspapers as the training corpus is also notable. Indian English spans an enormous range of journalistic styles, regional focuses, and cultural contexts — from national business reporting to state-level political coverage to local disaster news. A system trained on this diversity and achieving high performance on it is likely to generalize well to other varied corpora.

What's Next

MODEE has real limitations that the paper is transparent about. The system follows a one-event-per-document setting — it extracts the single main event described in a document, not all events. Real documents, especially long-form journalism or government reports, contain multiple events nested within each other. Extending the architecture to handle multi-event extraction is a natural and important next step.

The dataset, while large by the standards of this subfield, is drawn from a specific five-year window (2015–2019) and a specific national context (India). How well MODEE generalizes to other languages, other national contexts, or more recent events is an empirical question that remains open.

The complete graph construction — connecting every token to every other token — is computationally expensive for very long documents. For a 512-token document, that means roughly 131,000 potential edges. The researchers cap document length at 512 tokens partly for this reason. As documents grow longer, more efficient graph construction strategies will be needed.

There are also questions about what the graph is actually capturing. The paper demonstrates empirically that the graph helps — but the precise structural or semantic patterns that GraphSAGE learns to encode, and why they help the fusion module compute better attention scores, remain partially opaque. Interpretability work on the graph encoder's learned representations could be illuminating.

Finally, MODEE currently extracts the 5Ws as free-text strings, evaluated against annotated answers. Integrating this output into downstream knowledge bases — where precise, canonicalized entities and relations are needed — would require additional normalization steps. The pipeline from MODEE's outputs to a structured knowledge graph is not yet closed.

What MODEE opens up is significant nonetheless. It demonstrates convincingly that the lost-in-the-middle problem plaguing LLMs on long documents is not a fundamental limitation of neural approaches to event extraction — it's a limitation of using only one modality. When you give the model a second way of seeing the document, a way that preserves global structure rather than sequential attention, performance jumps in ways that raw model scale cannot match.

For the researchers building the next generation of emergency response systems, news intelligence platforms, and automated knowledge graph construction pipelines, that's a result worth paying close attention to.

The AI That Reads the Whole Story: A Smarter Way to Extract Events from News

The Science

What They Found

Why This Changes Things

What's Next