Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

architecturellmmodel-independenceengineering

What Breaks When You Swap Your LLM

Name: AitherOS
Author: Aitherium

May 25, 20265 min readAitherium

What Breaks When You Swap Your LLM — And What Doesn't

The "replace the model tomorrow" test is the quickest way to tell a real agent system from a wrapper. If your agent breaks when you change from GPT-4 to Claude to a local model, you don't have an agent — you have a prompt chain with extra steps.

We ran this test on AitherOS. Here's the honest breakdown.

What Changes (By Design)

1. Context Scaffolding Intensity

Stronger models need less hand-holding. AitherOS's NeuronScaler adjusts automatically:

Model Tier	Neuron Multiplier	Recursive Refinement
Frontier (Claude Opus, GPT-4o)	0.6x	Skipped
Strong (Sonnet, Qwen-70B)	0.8x	Enabled
Standard (Haiku, Qwen-30B)	1.0x	Enabled
Weak (Qwen-4B, Phi-3)	1.3x	Enabled

This mapping now lives in config/model_tiers.yaml — add a new model without touching code.

2. Tool Call Format

Different models emit tool calls differently. Our ToolCallParser handles 5 formats transparently:

OpenAI-style JSON tool_calls array
Anthropic XML <tool_call> tags
Hermes/ChatML format
YAML-in-markdown blocks
Plaintext function signatures

The agent runtime doesn't care which format the model uses.

3. Effort-Based Routing

Effort levels 1-3 route to small models, 4-6 to the orchestrator model, 7-10 to reasoning models. When you swap models, the pool changes but the routing logic doesn't.

What Doesn't Change (The Hard Part)

Safety Gates

Permission tiers (OBSERVE through HUMAN_REQUIRED) are enforced before the LLM sees the tool. The model can't bypass SafetyGates by formatting a tool call differently. Gate checks happen in ActionExecutor and AgentRuntime._execute_single_tool() — model-independent.

Capability Tokens

HMAC-SHA256 signed, per-agent. The model doesn't generate these — the kernel does. Default-deny: an agent can only use tools it has explicit capability grants for.

Memory System

The 6-tier memory pipeline (ephemeral through identity) doesn't depend on model capabilities. Outcome-based decay penalizes memories that led to failures regardless of which model was running. SUPERSEDES chains track fact evolution independent of which model updated the fact.

Escalation System

When an agent calls escalate_to_human, the escalation goes through EscalationDispatcher (Redis-backed, DM + email notifications). The approval/deny flow is entirely model-independent — it's a state machine, not a prompt.

Error Learning

The ErrorLearningStore records tool failure patterns and recovery strategies. When tool X fails with error Y, the recovery guidance is injected into the retry context regardless of which model is reasoning about it.

Task State

ForgeSession snapshots persist to disk on every state transition. If Genesis restarts mid-task, the session state survives. This is SQLite + JSON, not model memory.

Audit Trail

The unified audit endpoint cross-references 4 sources (Chronicle, SafetyGates, Escalations, Forge sessions) to reconstruct what any agent did. This is structured data, not model-dependent recall.

The Actual Test

We run AitherOS with three backends simultaneously:

Local: aither-orchestrator (Nemotron-Orchestrator-8B fine-tune) on a 5090 via vLLM (effort 1-6)
DGX Spark: Qwen3-27B for reasoning tasks (effort 7-10)
Cloud: Claude/GPT as fallback

Swapping the local model from aither-orchestrator to Gemma to LLaMA requires:

Change config/model_tiers.yaml — add the new model name
Update the vLLM model path
Restart MicroScheduler

That's it. No code changes. The 12-stage context pipeline, safety gates, capability tokens, memory reinforcement, escalation system, and error learning all work identically.

What We Learned

The gap isn't in design — it's in wiring. We had SafetyGates for two months before they were actually connected to the execution path. We had MemoryReinforcement but no outcome-based decay. We had ForgeSession but it was in-memory only.

The boring infrastructure — permission checks before every tool call, session snapshots on state transitions, error pattern databases, fact evolution chains — is what makes "replace the model tomorrow" actually work.

Real model independence isn't about abstracting the API call. It's about making sure everything around the LLM is deterministic, persistent, and model-agnostic.

Enjoyed this post?

All posts Try AitherOS