What Breaks When You Swap Your LLM
What Breaks When You Swap Your LLM — And What Doesn't
The "replace the model tomorrow" test is the quickest way to tell a real agent system from a wrapper. If your agent breaks when you change from GPT-4 to Claude to a local model, you don't have an agent — you have a prompt chain with extra steps.
We ran this test on AitherOS. Here's the honest breakdown.
What Changes (By Design)
1. Context Scaffolding Intensity
Stronger models need less hand-holding. AitherOS's NeuronScaler adjusts automatically:
| Model Tier | Neuron Multiplier | Recursive Refinement |
|---|---|---|
| Frontier (Claude Opus, GPT-4o) | 0.6x | Skipped |
| Strong (Sonnet, Qwen-70B) | 0.8x | Enabled |
| Standard (Haiku, Qwen-30B) | 1.0x | Enabled |
| Weak (Qwen-4B, Phi-3) | 1.3x | Enabled |
This mapping now lives in config/model_tiers.yaml — add a new model without touching code.
2. Tool Call Format
Different models emit tool calls differently. Our ToolCallParser handles 5 formats transparently:
- OpenAI-style JSON
tool_callsarray - Anthropic XML
<tool_call>tags - Hermes/ChatML format
- YAML-in-markdown blocks
- Plaintext function signatures
The agent runtime doesn't care which format the model uses.
3. Effort-Based Routing
Effort levels 1-3 route to small models, 4-6 to the orchestrator model, 7-10 to reasoning models. When you swap models, the pool changes but the routing logic doesn't.
What Doesn't Change (The Hard Part)
Safety Gates
Permission tiers (OBSERVE through HUMAN_REQUIRED) are enforced before the LLM sees the tool. The model can't bypass SafetyGates by formatting a tool call differently. Gate checks happen in ActionExecutor and AgentRuntime._execute_single_tool() — model-independent.
Capability Tokens
HMAC-SHA256 signed, per-agent. The model doesn't generate these — the kernel does. Default-deny: an agent can only use tools it has explicit capability grants for.
Memory System
The 6-tier memory pipeline (ephemeral through identity) doesn't depend on model capabilities. Outcome-based decay penalizes memories that led to failures regardless of which model was running. SUPERSEDES chains track fact evolution independent of which model updated the fact.
Escalation System
When an agent calls escalate_to_human, the escalation goes through EscalationDispatcher (Redis-backed, DM + email notifications). The approval/deny flow is entirely model-independent — it's a state machine, not a prompt.
Error Learning
The ErrorLearningStore records tool failure patterns and recovery strategies. When tool X fails with error Y, the recovery guidance is injected into the retry context regardless of which model is reasoning about it.
Task State
ForgeSession snapshots persist to disk on every state transition. If Genesis restarts mid-task, the session state survives. This is SQLite + JSON, not model memory.
Audit Trail
The unified audit endpoint cross-references 4 sources (Chronicle, SafetyGates, Escalations, Forge sessions) to reconstruct what any agent did. This is structured data, not model-dependent recall.
The Actual Test
We run AitherOS with three backends simultaneously:
- Local: aither-orchestrator (Nemotron-Orchestrator-8B fine-tune) on a 5090 via vLLM (effort 1-6)
- DGX Spark: Qwen3-27B for reasoning tasks (effort 7-10)
- Cloud: Claude/GPT as fallback
Swapping the local model from aither-orchestrator to Gemma to LLaMA requires:
- Change
config/model_tiers.yaml— add the new model name - Update the vLLM model path
- Restart MicroScheduler
That's it. No code changes. The 12-stage context pipeline, safety gates, capability tokens, memory reinforcement, escalation system, and error learning all work identically.
What We Learned
The gap isn't in design — it's in wiring. We had SafetyGates for two months before they were actually connected to the execution path. We had MemoryReinforcement but no outcome-based decay. We had ForgeSession but it was in-memory only.
The boring infrastructure — permission checks before every tool call, session snapshots on state transitions, error pattern databases, fact evolution chains — is what makes "replace the model tomorrow" actually work.
Real model independence isn't about abstracting the API call. It's about making sure everything around the LLM is deterministic, persistent, and model-agnostic.