856 Tools, Zero Bloat: How We Make Massive MCP Work Without Destroying Context
Every AI platform eventually hits the same wall: tools.
You start with 10. Reading files, writing files, running commands. Fits in context easily. Then you add code search. Then deployment. Then security audits, memory queries, image generation, email, social media, cloud GPU provisioning, training pipelines, 29 agent personas, and a knowledge graph. Suddenly you have 856 tools across 94 modules and 15 domains — and every tool schema you inject into the LLM prompt costs ~400 tokens.
856 tools x 400 tokens = 342,400 tokens of tool schemas alone. That's your entire context window. For tools the model probably won't use.
This is the scaling wall nobody talks about. OpenAI's agent SDK sends all tools every turn. MCP servers expose flat catalogs. Everyone assumes the LLM will figure out what's relevant from a list of 200 functions. It doesn't. It hallucinates tool names. It picks the wrong one. It spends reasoning tokens scanning irrelevant schemas instead of thinking about your actual problem.
We solved this. Here's how.
The Full Inventory: What 856 Tools Actually Looks Like
Before the solution, the problem. AitherOS runs 196 microservices across 12 architectural layers. AitherNode (apps/AitherNode/mcp_server.py) is the single MCP gateway — it auto-discovers every public function with a docstring and type hints from 94 mcp_*.py modules. No manual registration. No schema files.
Here's what the catalog actually contains:
Infrastructure & Deployment (~105 tools, 6 modules)
mcp_infrastructure (41 tools) is the largest single module — Docker lifecycle, VM provisioning, HyperV templates, network inspection, service health checks. mcp_deploy and mcp_cloud_deploy handle push-to-production across local Docker, cloud VMs, and edge nodes. mcp_cloud_training manages GPU rentals on Vast.ai. mcp_chaos runs chaos engineering campaigns against your own infrastructure.
Code Intelligence (~80 tools, 8 modules)
mcp_codegraph and mcp_codegraph_registry expose AST-level analysis — callers, callees, dependency chains, symbol search across 62,867 indexed chunks. mcp_repowise adds 4-layer intelligence: graph (dependency DAG), git (hotspots, ownership, co-change), docs (LLM-generated wiki per module), and decisions (architectural decision archaeology). mcp_acc provides dimensional analysis — every symbol scored on Stability, Friction, Logic, and Autonomy axes. mcp_flow (34 tools) covers the full development workflow from analysis to refactoring to code review.
Agent Orchestration (~60 tools, 6 modules)
mcp_orchestrator talks directly to Genesis (port 8001) — the system brain. mcp_forge spawns subagents with full ReAct loops, tool access, and sandbox execution. mcp_swarm runs an 11-agent coding swarm: architect designs, 8 agents execute in parallel (3 coders, 2 testers, 2 security, 1 scribe), reviewer consolidates, judge accepts or rejects. mcp_agent_delegate lets you ask any of 29 named agents a question and get structured JSON back with file paths, code refs, and suggestions. mcp_fleet manages multi-agent fleets. mcp_reasoning offloads deep thinking to cloud GPUs.
Memory & Context (~60 tools, 6 modules)
mcp_memory handles semantic long-term memory with embeddings. mcp_context and mcp_context_synthesis manage the 12-stage context assembly pipeline — the system that decides what the LLM knows about itself, the user, the conversation, and the world on every turn. mcp_strata reads and writes to the observability database. mcp_eventgraph queries the full event history. mcp_sttp (Spatio-Temporal Transfer Protocol) persists cognitive state across sessions.
Generation & Media (~74 tools, 10 modules)
Image generation via ComfyUI, video processing, voice synthesis, emotion detection, vision analysis, OCR, screen parsing via OmniParser, 3D character pipelines, human figure generation, and multimodal tool orchestration. Ten modules that cover every modality.
Storage & Files (~50 tools, 4 modules)
Beyond basic read/write: RAG ingestion, document parsing, virtual filesystem paths, cross-service file routing, and a full storage layer with versioning.
Search & Web (~52 tools, 5 modules)
Web search, browser context management (headless Chrome automation), screen parsing, external API integrations (20 tools in mcp_external alone), and code exploration.
Communication (~32 tools, 3 modules)
Email composition and sending, social media posting (Bluesky), and mcp_relay — a 21-tool IRC-meets-AI chat relay system.
LLM & Models (~51 tools, 5 modules)
Ollama management, model fitting to hardware (LLMFit), learning loop management, LLM trace analysis, and model stack configuration for multi-model vLLM routing.
Training & Data (~57 tools, 3 modules)
Training data collection, session harvesting from 5 platforms (Claude, Codex, Gemini, ChatGPT, Copilot), world model management, and fine-tuning orchestration.
Security & Access (~48 tools, 5 modules)
RBAC management (20 tools), secrets vault operations, security campaign execution, permission auditing, and account lifecycle management.
System & Diagnostics (~49 tools, 5 modules)
Context X-ray (pipeline debugging), system fingerprinting, network inspection, command execution, and tool trace analysis.
Services & Gateway (~51 tools, 6 modules)
Service mesh management, MCP SaaS gateway operations, companion mode (personality overlays), persona management, and managed service lifecycle.
Content & Documentation (~65 tools, 5 modules)
Blog publishing, notebook management (37 tools for Jupyter-style notebooks), Neocities static site deployment, roadmap tracking, and document generation.
Specialized (~113 tools, 17 modules)
Business operations, employment/HR workflows, intent planning, OSS tracking, release management, repository onboarding, code review workflows, soul/personality loading, spec management, strategy planning, tenant onboarding, consumer protection analysis (Themis), and Vast.ai GPU host management (31 tools across two modules).
Why You Can't Just Send All of Them
The math is brutal. A 20-turn agent conversation where every turn includes all tool schemas:
856 tools x ~400 tokens/schema = 342,400 tokens (tools alone)
x 20 turns = 6,848,000 tokens per session
Even with a 128K context window, you'd overflow by turn 8. With a 32K window, you'd overflow on turn 1.
But the real problem isn't tokens — it's attention. LLMs don't linearly scan tool lists. They attend to them. 856 tools means 856 candidates competing for attention on every function-calling decision. The model wastes reasoning cycles on vastai_list_instances when the user asked a question about memory. It hallucinates hybrid tool names. It picks mcp_filesystem.read_file when mcp_codegraph.search would have been right.
More tools in context = worse tool selection. This is the fundamental tension.
The Solution: ToolGraph — A 3-Tier Selection System
lib/core/ToolGraph.py (1,119 lines) is the answer. Instead of sending all tools, it predicts which 8 tools the LLM will actually need for each turn. Three tiers, each faster and more learned than the last.
Tier 1: NanoGPT Predictor (<1ms)
A character-level NanoGPT model trained on actual tool-usage sessions. Not categories — individual tools. It takes the task description, evaluates each tool's relevance via loss scoring, and returns the top candidates.
Architecture: 1-layer transformer, 32-dim embeddings, 4 attention heads, block size 32. Tiny. Fast. Per-tool LoRA adapters for category-specific fine-tuning. Triggers retraining automatically after 50 labeled sessions accumulate.
When it's confident (top-K coverage >= 50%), it returns the selection directly. Sub-millisecond. No embeddings, no HTTP calls, no search index.
Tier 2: Hybrid Search (~10-150ms)
When NanoGPT isn't trained yet or isn't confident:
Keyword search runs TF-IDF scoring over tool names, descriptions, and parameters. IDF weights calculated per tool based on document frequency across the full registry. Normalized to [0,1].
Semantic search encodes the task prompt via the embedding engine (768-dim vectors) and compares against pre-computed tool embeddings. Falls back gracefully if the embedding engine is unavailable.
Score merging: 40% keyword, 60% semantic. Then two boost layers:
- Category boost (3x): ToolNanoGPT predicts top 3 categories, tools in matching categories get tripled scores
- Intent boost (1.5x): IntentToolMap's learned routing biases tools that succeeded for this intent type before
Tier 3: Full Fallback
If tiers 1 and 2 produce nothing — new system, no history, embeddings down — return all tools. Graceful degradation. The system never fails to provide tools; it just gets smarter about which ones over time.
Category Penalties: Domain-Aware Suppression
Even within hybrid search, ToolGraph suppresses obviously irrelevant domains:
_CATEGORY_GATE_TERMS = {
"ui": {"ui", "visual", "screenshot", "element"},
"perception": {"image", "audio", "voice", "transcribe"},
"generation": {"generate", "image", "comfy"},
"consumer": {"form", "contract", "price", "dark pattern"},
}
If your task doesn't mention any gate terms for a category, tools in that category get a 10x penalty. "Write a unit test for the auth middleware" won't surface image generation tools.
Essential Tools: The Safety Net
Seven tools are force-included regardless of ToolGraph predictions:
_ESSENTIAL_TOOLS = {
"read_file", "write_file", "replace_in_file",
"search_files", "list_directory", "code_search", "reason"
}
These are the tools the model is trained to reach for on almost every task. Removing them causes cascading failures — the model tries to call read_file, gets "unknown tool", and derails. Cheap insurance.
The Escape Hatch: tool_search
What if the 8 preselected tools don't include what the LLM needs? The tool_search meta-tool.
Added automatically when effort >= 3 (anything beyond trivial one-shot answers), tool_search lets the LLM query the full registry mid-turn:
- LLM receives 8 preselected tools +
tool_search - Realizes it needs image generation (not in the preselected set)
- Calls
tool_search("generate image") - Gets back 5 matching tools with names and descriptions
- Calls
generate_image(prompt="...")directly
The discovered tool gets auto-promoted into the active set. Next time the LLM calls it on the same turn, it's already registered — no re-validation needed.
if not tool_def and self._all_tools is not self._tools:
full_def = self._all_tools.get(call.tool_name)
if full_def:
self._tools.register(call.tool_name, ...)
This is the critical insight: you don't need all tools in context to have access to all tools. You need a good predictor and a discovery mechanism. The LLM gets the tools it probably needs upfront, and a search function for everything else.
The Learning Loop: How It Gets Smarter
ToolGraph doesn't just select — it records outcomes and improves.
ToolNanoGPT Training Pipeline
lib/cognitive/ToolNanoGPT.py (505 lines) manages the self-learning loop:
Recording: Every tool invocation writes to Library/Training/tool_usage/sessions.jsonl:
{"task": "review the auth middleware", "tools_used": ["codegraph_search", "read_file", "reason"], "timestamp": "..."}
Feedback classification: When predictions meet reality, three outcomes:
- CONFIRMED: Predicted AND used AND succeeded. Strong positive signal.
- REJECTED: Predicted but NOT used. Negative signal — stop predicting this.
- DISCOVERED: NOT predicted but WAS used (via tool_search). New positive signal — start predicting this.
Retraining: JarvisBrain (the 30-second awareness tick) calls train_if_ready() every cycle. If 30 minutes have elapsed and sufficient data accumulated (50+ base records or 20+ per-category), it retrains the base model and per-category LoRA adapters.
IntentToolMap: Learned Routing
lib/cognitive/IntentToolMap.py maps intent types to tools based on historical success:
Seed mappings bootstrap the system:
"question" -> ["web_search", "query_memory", "codegraph_search", "rag_query"]
"code" -> ["explore_code", "codegraph_get_context", "fs_write_file", "run_terminal_command"]
"security" -> ["security_audit", "rbac_check", "capability_verify"]
Learned routing refines these based on actual outcomes. Each intent-tool pair accumulates a score: success adds +1.0, failure subtracts -0.3, and a decay rate of 0.995 per record ensures old mappings fade as usage patterns evolve.
The result: after a few dozen sessions, the system knows that "deploy" intents need cloud_deploy, git_push, and infrastructure_status — not because we hardcoded it, but because that's what actually worked.
Effort-Based Context Scaling
ToolGraph is only one layer. The EffortScaler (lib/faculties/EffortScaler.py) controls how much context the LLM gets at every level — tools, system prompt, memory, and reasoning depth.
Effort 1-2: Trivial
- Context depth: axioms only (~500 tokens)
- Orchestration: direct LLM call, no tools
- Reasoning: skip
- Total overhead: ~900 tokens
"What time is it?" doesn't need 856 tools. It doesn't need tools at all.
Effort 3-4: Simple
- Context depth: axioms + working memory (~1,200 tokens)
- Orchestration: tools enabled, 8 preselected
- Reasoning: gate check (maybe skip)
- Total overhead: ~4,400 tokens
Effort 5-6: Standard
- Context depth: full pipeline — axioms, memory, persona, affect (~4,000 tokens)
- Tools: 8-10 preselected + tool_search + force-includes
- Reasoning: full SASE pipeline
- Total overhead: ~9,100 tokens
Effort 7-10: Critical
- Context depth: all layers — knowledge graph, code graph, decision history (~10,000+ tokens)
- Tools: 12-14 preselected + tool_search
- Reasoning: SASE + verification
- Total overhead: ~19,000+ tokens
The key: effort level is classified automatically. IntentEngine analyzes the user message, detects complexity signals, and EffortScaler maps that to an execution plan. "What's 2+2" gets 900 tokens of overhead. "Redesign the authentication layer to support SAML SSO" gets 19,000.
Trained Model Optimization
When the system detects model_is_trained=True (the fine-tuned aither-orchestrator model), it skips injecting soul prompts, static rules, system knowledge, and response format instructions. The model has internalized those during training. Savings: ~2,800 tokens per turn.
Permission-Based Filtering: Three More Layers
Even after ToolGraph selects 8 tools, three more filtering layers ensure only appropriate tools reach the LLM.
Runtime Mode
AitherNode detects its connectivity state and filters accordingly:
- LOCAL (Genesis reachable): All 856 tools available
- REMOTE (Cloud gateway): Tenant-scoped subset via MCP SaaS gateway
- STANDALONE (Offline): 35 modules blocked (
mcp_memory,mcp_training,mcp_orchestrator,mcp_codegraph,mcp_swarm, etc.), leaving ~500 local-only tools
CallerContext Permissions
Every tool call is checked against 6 permission flags derived from the caller's identity:
can_communicate— gates email, social, relay toolscan_mutate— gates file writes, git commits, RAG ingestioncan_execute— gates terminal commands, workflow executioncan_forge— gates agent spawning, swarm orchestrationcan_generate_image/can_generate_docs— gates media creationPLATFORM_ONLY— gates secrets, LLM traces, admin ops
A public demo user (CallerType.DEMO) gets read-only access to ~200 tools. A platform admin (CallerType.PLATFORM) gets all 856.
Capability Tokens
Per-agent HMAC-SHA256 signed tokens from capabilities.yaml define exactly which MCP tools each agent can invoke. Demiurge (the coder) gets codegraph_search, repowise_get_answer, fs_write_file. It doesn't get security_audit or email_send. Athena (security) gets the inverse.
The Numbers
Before ToolGraph:
30 tools x 400 tokens = 12,000 tokens per turn (tools only)
20 turns = 240,000 tokens per session
After ToolGraph:
8 tools x 400 tokens = 3,200 tokens per turn
20 turns = 64,000 tokens per session
Savings: 176,000 tokens (73%)
With effort scaling (mix of trivial and complex turns):
Average overhead: ~4,500 tokens per turn
20 turns = 90,000 tokens total (including system prompt, context, tools)
vs. 500,000+ without any filtering
And it gets better over time. ToolNanoGPT's predictions improve with usage. IntentToolMap's learned routing sharpens. Category penalties suppress noise more precisely. The 73% savings is the floor, not the ceiling.
The Architecture Diagram
User Message
|
v
IntentEngine.classify()
|-- Intent type (24 categories)
|-- Effort level (1-10)
'-- Tool hints
|
v
EffortScaler.scale()
|-- context_depth: axioms | fast | full | full_graph
|-- orchestration_mode: direct | tools | pipeline
|-- tool_categories: [predicted domains]
'-- recommended_model: orchestrator | reasoning
|
v
ToolGraph.select_for_task(top_k=8)
|-- Tier 1: NanoGPT (<1ms, if trained)
|-- Tier 2: Hybrid keyword+semantic (~150ms)
|-- Category boost (3x from ToolNanoGPT)
|-- Intent boost (1.5x from IntentToolMap)
'-- Category penalties (0.1x for irrelevant domains)
|
v
AgentRuntime._preselect_tools()
|-- Force-include essentials (7 tools)
|-- Add tool_search (if effort >= 3)
|-- Filter by capability tokens
'-- Filter by caller permissions
|
v
LLM receives: 8-12 tool schemas (~3,200-4,800 tokens)
|
v
[If LLM needs unlisted tool]
|-- Calls tool_search("query")
|-- Gets 5 candidates back
|-- Calls discovered tool directly
'-- Auto-promotion registers it for remaining turns
|
v
Usage recorded -> ToolNanoGPT retrains -> Better predictions next time
Why This Matters Beyond AitherOS
The MCP ecosystem is growing fast. Every IDE, every agent framework, every SaaS product is adding MCP tool support. The current approach — expose a flat list of tools and pray — works with 20 tools. It breaks at 100. It's unusable at 500.
The solution isn't "fewer tools." The solution is intelligent selection. A system that:
- Predicts which tools matter for this specific task
- Discovers missed tools on demand without upfront cost
- Learns from every session to improve predictions
- Scales context proportional to task complexity
- Filters by permission, capability, and runtime mode
We've been running this in production across 94 MCP modules, 856 tools, and 15 domains. The NanoGPT predictor trains itself. The IntentToolMap refines itself. Category penalties suppress noise automatically. The system gets faster and more accurate with every conversation.
856 tools. 8 per turn. Zero bloat.
That's how you scale MCP.