The Hidden Tax Killing Your AI Agent — and the Fix That Costs 95% Less
Every time your AI agent calls a tool, it silently re-reads the instruction manual for every tool it will never use — and that waste is breaking large-scale AI
120 tools, 6 servers: Tool Attention cuts per-turn schema tokens from 47k to 2.4k — a 95% reduction.
Somewhere inside your AI agent's context window, a quiet crisis is unfolding. Every time the agent takes a conversational turn — answering a question, calling a function, planning a next step — it first reads a small novel's worth of instruction manuals for tools it will never touch. A GitHub integration with 93 tools dumps roughly 55,000 tokens of schema definitions into the prompt. An enterprise database connector with 106 tools adds another 54,600. These tokens don't disappear between turns. They come back. Every. Single. Turn.
This is the Tools Tax, and according to Sadani and Kumar (2026), it isn't just expensive — it's silently breaking the reasoning capabilities of the very agents we're building to do complex work.
The Science
The paper sits at the intersection of AI systems engineering and context-window economics. Sadani and Kumar formalize a problem that practitioners have been complaining about in blog posts and community forums for months, give it rigorous mathematical structure, and propose a concrete solution they've implemented and released as open-source code.
Their target is the Model Context Protocol (MCP) — the open standard that Anthropic introduced in November 2024 and that OpenAI, Google, and Microsoft have since adopted. MCP is the plumbing that lets AI agents talk to external tools: databases, filesystems, code repositories, Slack, web search, and more. It's elegant in concept. Rather than every AI company building custom integrations with every tool company (an nightmare), MCP creates a standardized handshake so any compliant agent can discover and call any compliant tool. The result: a much tidier world.
The catch is structural. MCP inherits the statelessness of the chat APIs it runs on top of. There's no persistent memory between turns. So every time the agent calls the API, it must re-send the entire tool catalog — name, description, full JSON parameter schema, output spec — for every single connected tool. Whether or not those tools are remotely relevant. Whether or not anything has changed since the last turn. The protocol has no concept of "you already know about these tools; here's just what's new."
The evaluation is a calibrated simulation, not a live agent benchmark — a distinction the authors flag repeatedly and honestly. They constructed a synthetic benchmark of 120 tools across six servers, calibrating per-server token counts to match public audits of real MCP deployments (
Per-Server MCP Tools Tax: Token Cost Per Turn
Token overhead per conversational turn for common MCP server types, calibrated to public deployment audits. The GitHub full suite and Enterprise DB dominate overwhelmingly.
| Label | Value |
|---|---|
| Filesystem | 1,500 |
| Web Search | 1,200 |
| Slack | 2,000 |
| Database | 2,500 |
| Git | 3,000 |
| GitHub (full, 93 tools) | 55,000 |
| Enterprise DB (106 tools) | 54,600 |
). They then measured the exact tokenized prompts that each approach would send to an LLM, using OpenAI's tiktoken tokenizer. End-to-end metrics like task success and cost are explicitly labeled "projected" — derived from measured token counts combined with published deployment telemetry, not from running live agents. This transparency is notable, and it matters for interpreting the results.
What They Found
The core measurement is striking. In a 120-tool, six-server deployment, naive MCP schema injection consumes 47,300 tokens per turn on tool definitions alone — before the user has said a word, before any reasoning has happened, before any results have come back. That's roughly 24% of a 200,000-token context window gone before the work begins (
Tool Attention vs. Baselines: Per-Turn Tool Tokens
Measured per-turn tool token consumption across approaches on the 120-tool, 6-server benchmark. Tool Attention reduces the overhead from 47,300 tokens to 2,400 — a 95% reduction.
| Label | Value |
|---|---|
| Full Schema (baseline) | 47,300 |
| Static Pruning | 28,000 |
| Simple Retrieval (top-k full) | 9,800 |
| CLI Lazy Discovery | 6,200 |
| Tool Attention (this work) | 2,400 |
).
Tool Attention brings that figure down to 2,400 tokens per turn — a 95% reduction. Effective context utilization (the fraction of the context window actually available for task-relevant content) rises from 24% to 91%.
Those numbers come from a mechanism with three interlocking parts.
First, an Intent–Schema Overlap (ISO) score. Before each turn, the system encodes the user's current message as a vector using a compact sentence-embedding model (a 22-million-parameter model called MiniLM-L6, which runs in about 30–60 milliseconds on a standard CPU). It then computes the cosine similarity — a measure of directional alignment in high-dimensional space — between that query vector and pre-computed embeddings of every tool's summary. Tools that don't semantically match the query get a low score. Formally:
where is the query embedding and is the tool summary embedding.
Second, a stateful gating function. ISO scoring is powerful but not sufficient — some tools should be excluded not because they're irrelevant but because the agent doesn't yet have the right permissions, or because a prerequisite step hasn't been completed. The gating function combines the ISO threshold with explicit preconditions: "only offer this database write tool after the user has confirmed the plan," or "only offer GitHub push after authentication." This deterministic layer cannot be circumvented by a cleverly worded query, because it checks authoritative agent state, not free text.
Third, two-phase lazy schema loading. Even after gating, there's a question of what to actually send to the model. Tool Attention keeps a compact summary pool — a single-sentence description of every tool, about 40 tokens each — always resident in context. For 120 tools, that's roughly 4,800 tokens: static, cacheable, and enough for the model to know that tools exist. Only for the small active set selected by gating does the system load and inject the full JSON schema. In the benchmark, this active set averages about 6 tools per turn.
The paper also reports a hallucination rejection gate: if the model tries to call a tool that wasn't promoted this turn (because it only saw the summary, not the full schema), the middleware rejects the call and returns a structured error. This triggers on 2.3% of turns; in 78% of those cases the model recovers on the next turn by selecting an available tool (Sadani and Kumar, 2026).
Effective Context Utilization: Before and After Tool Attention
The fraction of the context window available for task-relevant content (user messages, reasoning, tool outputs). Tool Attention raises utilization from 24% to 91% by removing schema bloat.
| Label | Value |
|---|---|
| Full Schema (baseline) | 24 |
| ~70% cliff (reasoning degrades above this) | 30 |
| Tool Attention (this work) | 91 |
Why This Changes Things
The most important framing in the paper isn't about tokens. It's about reasoning.
There's a known phenomenon in LLM deployments — the authors cite multiple empirical studies — where reasoning quality degrades sharply once context utilization exceeds roughly 70%. Below that threshold, models perform normally. Above it, they start hallucinating tool parameters, confusing similar-sounding functions, and losing track of multi-step plans. The authors call this "mid-session drift": the agent's behavior degrades not from any single catastrophic error but from a slow erosion of its usable reasoning surface (Sadani and Kumar, 2026).
In a naive MCP deployment with 120 tools, the tool schemas alone consume enough context to push utilization past that cliff before the agent has done anything. The task content, conversation history, and intermediate results — the things that actually matter — are fighting for what's left. Tool Attention inverts this ratio.
There's also a cost dimension that the paper contextualizes dramatically. The authors cite a published benchmark comparing CLI-equivalent workflows (where tool discovery is handled efficiently) against MCP workflows for the same 10,000 operations: $3.20 versus $55.20 — a 17× difference, almost entirely attributable to schema token inflation (Sadani and Kumar, 2026, citing community benchmarks). If those numbers hold in production, the economic case for something like Tool Attention isn't marginal — it's transformative for any organization running agents at scale.
Then there's the security angle, which is the paper's most interesting secondary contribution. Tool Poisoning Attacks (TPAs) are a real and documented threat: an adversary who controls even one tool's description can inject hidden instructions into that description, which the LLM's attention mechanism then processes alongside the legitimate content. The attack doesn't require the poisoned tool to be called — just present in the prompt. Tool Attention's gating mechanism has a natural defensive property here. A poisoned tool whose description doesn't semantically match the current query gets gated out entirely. It never touches the model's attention layers. The attack surface shrinks proportionally with the active set size.
The paper grounds this in a theoretical framework from the MCP security literature called Total Attention Energy (TAE) — a measure of how much causal influence a context token has over the model's output decisions, summed across all attention heads and layers. The key insight: TAE cannot be high for a schema that isn't in the prompt. The ISO score serves as a cheap, embedding-space proxy for expected TAE. Tools with low expected TAE can be excluded safely.
Methodologically, Tool Attention is positioned as middleware — it slots into the before_model hook of frameworks like LangGraph and runs entirely between the user's message and the model's API call. It doesn't require changes to the MCP protocol, doesn't require retraining the LLM, and doesn't require tool authors to modify their servers. The reference implementation uses commodity components: FAISS for vector indexing, sentence-transformers for embeddings, and tiktoken for token counting. The computational overhead of the routing step itself is sub-millisecond for catalogs up to 10,000 tools on commodity hardware.
One practical detail worth highlighting: the Phase-1 summary pool is stable across turns (it only changes when the tool catalog changes), which means it sits in the stable prefix of the prompt and earns full prompt-cache credit from providers like Anthropic. The authors report an 84% cache hit rate across a 30-turn session with Tool Attention, versus 22% with naive full-schema injection. This compounds the savings: not only are fewer tokens sent, but a higher fraction of those tokens are served from cache at reduced cost.
What's Next
The paper is careful about what it hasn't proven. The core token-reduction numbers are directly measured and reproducible. But the projected claims — that task success rates improve by roughly 12 percentage points, that P50 latency drops by 41%, that marginal cost falls by 82% — are extrapolations from token counts combined with published deployment telemetry, not from live agent evaluations. Sadani and Kumar flag this consistently throughout, which is commendable. The next obvious step is exactly what the paper doesn't yet provide: an end-to-end evaluation on real LLM agents running real multi-step tasks, measured against ground truth.
There are also open questions about routing failures. The ISO threshold is calibrated on a held-out set of query–tool pairs, typically landing between 0.22 and 0.32. But tool descriptions are written by humans, and humans are inconsistent. The authors address this partially with a summarize_tool.py utility that uses an LLM to regenerate summaries in a "user intent voice" — and report that this reduces average summary length by 63% while improving retrieval recall by 88 points — but the quality of tool summaries remains a deployment-specific variable that will require ongoing maintenance.
The paper also acknowledges two complementary research directions. One is Anthropic's code-execution pattern, which shifts agents from a "reason-call-reason" loop to a single orchestrated script that filters and aggregates tool outputs inside a sandbox, achieving up to 98.7% token reduction on data-heavy output workflows (Sadani and Kumar, 2026). That technique optimizes what comes back from tools; Tool Attention optimizes what goes in. A combined system handles both ends of the context-engineering stack.
The other is the emerging MCP over QUIC transport layer, which proposes native subscription and edge caching at the protocol level. If that specification matures and is widely adopted, it would address parts of the Tools Tax at a lower level than middleware — and potentially make Tool Attention's approach unnecessary for the schema-injection problem specifically. The authors acknowledge this gracefully: Tool Attention is deployable today, and can be retired cleanly when transport-layer solutions arrive.
The deeper thesis — that protocol-level efficiency, not raw context length, is the binding constraint on scalable agentic systems — feels like the paper's most durable contribution. The industry has spent enormous energy extending context windows: from 4,000 tokens to 32,000, then 128,000, now 1,000,000 and beyond. But if a 120-tool deployment consumes 47,000 tokens per turn on schema overhead alone, a million-token context window doesn't solve the problem — it just delays hitting the wall. The efficiency of what's inside the window matters as much as the window's size.
That reframe points toward a broader research agenda. How do we make the context window not just larger but smarter about what it carries? Tool Attention is one answer for one category of content. The same logic — maintain a compact, always-resident summary; promote full detail only when needed; gate on relevance; reject hallucinations deterministically — could apply to memory systems, retrieved documents, conversation history, and intermediate reasoning traces. The paper is, in that sense, less a final answer than a template for thinking about context engineering as a first-class systems problem.
For anyone building AI agents at scale today, the paper offers something rarer than a theoretical framework: a concrete, deployable solution to a concrete, expensive problem, with the code already on GitHub.
Protocol-level efficiency, not raw context length, is a binding constraint on scalable agentic systems.
Sign in to join the conversation.
Comments (0)
No comments yet. Be the first to share your thoughts.