The "Dual-Brain" That Could Make 5G Networks

Somewhere deep in the infrastructure of nearly every modern city, a 5G base station is making thousands of micro-decisions every second — which users get bandwidth, which signals get priority, how to share a finite pool of radio resources among dozens of competing devices. These decisions are increasingly made by AI. But here's the dirty secret: creating that AI, tuning it, writing the code that embeds it in the network, and deploying it safely can take days to weeks of work by multiple specialized engineers. For a technology meant to respond dynamically to changing conditions, that's a remarkable bottleneck.

A new paper by Natanzi, Gajjar, Tang, and Shah (2026) proposes a solution that is at once technically clever and conceptually simple: use a large language model not as the AI running inside the network, but as the AI that builds the AI. The result is a "Dual-Brain" architecture that can take a plain-English instruction from a network operator — "predict congestion and reserve bandwidth for cell-edge users" — and return a fully trained, containerized, deployed network-intelligence application in 384 milliseconds. That's faster than the blink of an eye and, crucially, well under the one-second budget that the relevant network layer permits.

The Science

To understand why this matters, it helps to know a little about how modern cellular networks are governed. The Open Radio Access Network, or O-RAN, is an industry standard that disaggregates a traditional base station into modular components connected via open interfaces (Polese et al., 2023). Crucially, O-RAN introduces two intelligent controllers. The Near-RT RIC — Real-time Intelligent Controller — operates at the network edge and must respond within 10 milliseconds to 1 second. It hosts small AI applications called xApps that collect telemetry and issue control commands in near-real time. Above it sits the Non-RT RIC, which operates on timescales longer than one second and handles policy guidance and application lifecycle management via rApps.

This architecture was designed with AI in mind. But the actual process of provisioning a new AI application has remained stubbornly manual. A network engineer extracts performance metrics. A data scientist trains a model. A software engineer writes the xApp. A deployment engineer rolls it out. Each handoff introduces delay, and the accumulated friction can stretch a deployment across days. The O-RAN standards body defines a conceptual workflow for AI/ML, but leaves the implementation entirely open — a specification gap that the field has not yet filled.

Large language models — the technology behind tools like ChatGPT and Claude — seem like an obvious candidate to close this gap. They are extraordinarily good at understanding natural language, reasoning about systems, and generating code. But they have a fundamental incompatibility with real-time network control: they are slow, probabilistic, and enormous. An 8-billion-parameter LLM requires 100 to 500 milliseconds per inference and occupies 4 to 16 gigabytes of memory. The Near-RT RIC's control loop demands decisions in under 10 milliseconds. That's a gap of two orders of magnitude — and no amount of prompt engineering closes it.

The Dual-Brain architecture, prototyped by researchers at Worcester Polytechnic Institute and North Carolina State University on a containerized O-RAN 5G testbed, resolves this tension by assigning each brain exactly the task it is suited for.

Figure 1: The Dual-Brain architecture. The ZTO-Agent (LLM orchestrator, Non-RT RIC rApp) parses intents, curates data, and synthesizes xApp code. NeuralSmith (ML engine) trains lightweight classifiers and returns ONNX models via API. The trained model is injected into a pre-verified xApp template and deployed to the Near-RT RIC. Source: Seyed Bagher Hashemi Natanzi, Pranshav Gajja

Brain 1 is the Zero-Touch Orchestration Agent, or ZTO-Agent — an rApp running in the Non-RT RIC, powered in the prototype by Meta's Llama-3.1-8B model. It handles the slow, semantic, reasoning-heavy work: parsing the operator's intent, deciding which network metrics to collect and how to label them, and synthesizing the code that will eventually run in the network. Critically, that code synthesis is not free-form generation. The LLM fills in variables inside pre-verified Jinja2 templates — think of it like a sophisticated mail-merge rather than writing a letter from scratch. This architectural constraint eliminates the most dangerous failure mode of LLM-generated code: hallucinated control logic reaching a live network.

Brain 2 is NeuralSmith, a dedicated automated machine-learning engine. When the ZTO-Agent has curated and labeled a dataset, it passes it to NeuralSmith via API with a task description and an inference-latency budget. NeuralSmith runs feature engineering, hyperparameter-optimized model selection across several algorithm families (Random Forest, XGBoost, LightGBM, compact neural networks), five-fold cross-validation, and exports the winning model as an ONNX artifact — a standardized, runtime-portable format — along with a validation report. The whole pipeline is automated. NeuralSmith is developed by All Things Intelligence, a company co-founded by co-author Vijay Shah.

The testbed itself uses Docker-containerized OpenAirInterface components with a software radio-frequency simulator, three simulated user devices generating realistic bursty traffic via iperf3, and FlexRIC providing the E2 interface connectivity that links the intelligent controller to the base station.

What They Found

The researchers demonstrated the full pipeline on a congestion-management scenario. Two central users generate bursty UDP traffic at 20 Mbps each across six on-off cycles over 20 minutes, while a third cell-edge user maintains a low-rate background session. The operator types the intent: "predict congestion and reserve 20% of Physical Resource Blocks for edge users." PRBs — Physical Resource Blocks — are the fundamental unit of radio bandwidth in 4G/5G; reserving them is how the network guarantees service to lower-priority users when the channel gets crowded.

The ZTO-Agent collects MAC-layer telemetry from the base station: per-user PRB allocations and signal-to-noise ratios. It auto-labels time intervals where aggregate cell PRB utilization exceeds 80% capacity as "congested." It dispatches the labeled dataset to NeuralSmith. NeuralSmith trains, validates, and returns a LightGBM classifier — a fast gradient-boosting algorithm — in an ONNX package weighing 49 kilobytes. The model achieves 97.7% accuracy with an F1-score (a measure balancing precision and recall) of 0.975. Its inference latency on a standard CPU is under one millisecond — comfortably inside the 10 ms budget (Natanzi et al., 2026).

LLM vs. ML Classifier: Key Properties for Real-Time Network Control

Comparison of inference latency and model size between an 8B-parameter LLM and the NeuralSmith-trained LightGBM classifier deployed in the Near-RT RIC.

LLM vs. ML Classifier: Key Properties for Real-Time Network Control
Label	Value
Inference Latency (ms)	300 ms
Model Size (MB)	10,000 ms

For comparison, the researchers also tested a simple hand-coded alternative: flag congestion whenever instantaneous PRB utilization exceeds 80%. This threshold rule achieves 84% accuracy — a reasonable result for zero engineering effort. But it reacts only to congestion that has already happened. The LightGBM classifier, operating on temporal features derived from sliding-window aggregations of PRB utilization (mean, standard deviation, minimum, and slope), anticipates congestion onset 2 to 3 measurement intervals before saturation occurs. In a network where conditions can shift in milliseconds, that predictive headroom is operationally significant.

Figure 3: ZTO-Agent (Llama-3.1-8B via Ollama) orchestration latency is dominated by intent parsing (552 ms); all other phases (KPM subscription generation, Jinja2 template rendering, and container registration) complete in under 82 ms combined. Total latency is 633 ms cold-start and 384 ms warm-state, well within the Non-RT RIC’s 1 s timing budget. Source: Seyed Bagher Hashemi Natanzi, Pranshav Gajja

The orchestration pipeline itself was measured at two operating points. Cold-start — the first invocation, when the LLM model is loading — runs in 633 milliseconds, dominated almost entirely by the intent-parsing phase (552 ms). Once the system is warm, steady-state orchestration completes in 384 milliseconds. Both are well within the Non-RT RIC's one-second budget, and both could be further reduced with dedicated GPU infrastructure (Natanzi et al., 2026).

ZTO-Agent Orchestration Latency Across Four Foundation Models

Warm-state end-to-end orchestration latency for four LLMs tested in the Dual-Brain pipeline. All complete within the Non-RT RIC's 1,000 ms budget. All produce identical deployed xApps.

ZTO-Agent Orchestration Latency Across Four Foundation Models
Label	Value
Qwen-2.5-Coder-7B	354 ms
Llama-3.1-8B	384 ms
Gemma4-26B	527 ms
Llama-3.3-70B	816 ms

One of the paper's more notable findings concerns LLM-agnosticism. The researchers repeated the full pipeline with four foundation models: Llama-3.1-8B (384 ms warm), Qwen-2.5-Coder-7B (354 ms), Gemma4-26B (527 ms), and Llama-3.3-70B (816 ms). All four completed orchestration within the one-second budget. More strikingly, all four produced identical output: the same Jinja2 template parameters, the same E2SM-RC action configuration, the same deployed xApp. The variation in latency, the researchers conclude, reflects unoptimized local serving rather than any structural limitation of the larger models. The architecture does not depend on — or need to be rebuilt around — any particular foundation model.

Figure 4: ZTO-Agent orchestration latency comparison across four foundation models, and the resulting inference-latency gap between LLM and NeuralSmith classifier. (a) All four models complete orchestration well within the Non-RT RIC’s 1 s budget, with Llama-3.1-8B at 384 ms and Llama-3.3-70B at 816 ms. (b) Once trained, the NeuralSmith ONNX classifier delivers inference at sub-millisecond latency, roughly 384×\times faster than direct LLM inference and well within the Near-RT RIC’s 10 ms control loop budget. Source: Seyed Bagher Hashemi Natanzi, Pranshav Gajja

Why This Changes Things

The practical implication is a compression of the AI-deployment lifecycle by potentially orders of magnitude. What currently requires a multi-person team and days of coordination could, in principle, be initiated by a network operator typing a sentence. The system handles data collection, labeling, model selection, code synthesis, containerization, and deployment — automatically, end-to-end.

Two design choices in this architecture deserve particular attention, because they address failure modes that have haunted previous attempts to use LLMs in operational systems.

The first is the template-constrained code synthesis. Earlier in their development, the researchers tried using the LLM for both orchestration and inference — prompting it to classify congestion states directly from raw telemetry. The results were bad: unacceptable latency, inconsistent outputs across identical inputs, and occasional hallucination of congestion states that did not exist. Switching to pre-verified templates solved all three problems simultaneously. The LLM's role is now confined to filling in parameters within a known-safe structure, not inventing control logic. Every xApp the pipeline produces inherits the template's tested correctness — a structural guarantee, not a probabilistic one.

The second is the latency-budget constraint passed to NeuralSmith. Without it, an automated ML engine might select an accurate but slow model — accurate enough to appear successful in testing but too slow for real deployment. By specifying the inference constraint upfront, the architecture ensures that correctness and speed are jointly optimized, not traded off silently.

The size differential between the two brains is also worth sitting with for a moment. The LLM orchestrator occupies 4 to 16 gigabytes and takes hundreds of milliseconds to reason. The classifier it produces is 49 kilobytes and responds in under a millisecond. That's roughly a 300,000-fold difference in size and a 500-fold difference in speed. The insight — that a large, slow, general model can generate a small, fast, specialized one — is not unique to this paper, but this work makes it concrete in a domain where the speed constraints are unusually strict and the consequences of failure are unusually tangible.

There is also a data-sovereignty dimension. Network operators handle sensitive traffic metadata, and regulatory requirements often prohibit that data from leaving the operator's infrastructure. The Dual-Brain architecture supports fully on-premise deployment: both the ZTO-Agent and NeuralSmith can run on the same compute cluster, keeping every byte of telemetry within the operator's trust boundary. This is not an afterthought — it is called out explicitly as a design requirement, and it matters for whether any architecture like this can actually be adopted by telecoms operating under European data-protection law or similar regimes.

What's Next

The paper is candid about its limitations. The prototype runs on a software-simulated RF channel — useful for isolating orchestration behavior and ensuring reproducibility, but silent on how the trained classifiers would hold up against real-world fading, interference, and hardware impairments. The researchers flag this as an open problem: techniques like domain randomization and online adaptation will likely be needed to make NeuralSmith-trained models robust across the messy diversity of real deployments.

The labeling strategy also needs work. Auto-labeling congestion above an 80% PRB threshold worked cleanly in the controlled scenario, but the threshold is sensitive and arguably arbitrary. More nuanced tasks — interference management, handover optimization, energy saving — will likely require adaptive labeling or multi-threshold strategies that the ZTO-Agent's current prompting cannot reliably generate.

Safety at scale is the deepest open question. When multiple autonomously provisioned xApps coexist in the same Near-RT RIC, their control actions can conflict: a throughput-boosting xApp competing with an energy-saving xApp for the same PRB pool, for instance. The paper advocates for pre-deployment conflict detection — possibly through static analysis of rendered templates — and for explainability reports that let operators understand what features the classifier relied on and what decisions the LLM made during synthesis. Without this transparency, regulators and operators will be reluctant to allow fully autonomous provisioning in production networks.

There is also the question of how confident the orchestrator should be in its own outputs. LLMs are famously prone to producing overconfident answers under ambiguous conditions. Recent work on conformal prediction — a statistical framework for attaching rigorous reliability guarantees to probabilistic outputs — applied to LLM-based control (Farzaneh et al., 2026) points toward one principled solution: the system could flag low-confidence intent interpretations for human review rather than proceeding automatically. This would not eliminate autonomy but would calibrate it, creating a dial between full automation and operator-in-the-loop oversight.

Finally, as this architecture scales toward 6G networks — where the vision is of AI-RAN systems that reason about themselves at every layer — the computational demands on the Non-RT RIC will grow. The 8B-parameter models tested here are accessible on modern GPU servers, but the smallest cell sites and rural deployments will need lighter, potentially telecom-specialized language models. Research on parameter-efficient fine-tuning and speculative decoding tuned to the structured, repetitive nature of orchestration prompts could extend the architecture to hardware where a full LLM is not feasible.

What the Dual-Brain paper ultimately demonstrates is a design philosophy as much as a system: stop asking whether LLMs should run networks, and start asking what kind of thinking LLMs are actually good at. Reasoning about intent, translating between human language and machine specifications, synthesizing code within guardrails — these are genuinely LLM-native tasks. Sub-millisecond inference, deterministic output, edge-deployable scale — these are not. Matching the tool to the task, rather than forcing one tool into every role, is a principle that sounds obvious but has been conspicuously absent from much of the hype around AI in telecommunications. This architecture makes it concrete, and makes it work.

The "Dual-Brain" That Could Make 5G Networks Self-Configuring in Under Half a Second

The Science

What They Found

Why This Changes Things

What's Next