Anatomy of a Hallucination Fix: How AitherOS Makes LLMs Check Their Own Work
Ask any LLM — GPT-4, Claude, Gemini, Llama, doesn't matter — this question:
"Which month of the year is spelled with an X?"
And it will tell you "October." Confidently. No hesitation.
O-C-T-O-B-E-R. Count the letters. There's no X.
We call this the October bug. Not because it's an October problem, but because October is the answer every model gives. It's a perfect example of LLM hallucination: the model doesn't lack knowledge (it knows the alphabet, it knows how to spell October), it just never checks. The pattern-matching shortcut in the transformer says "month + unusual letter = October" and the model runs with it.
Even with chain-of-thought prompting, most models skip the actual verification step. They'll write "Let me think about this..." and then jump straight to "October contains the letter X" without ever spelling it out.
We fixed this. Not with a bigger model, not with fine-tuning, not with a lookup table. We fixed it by building a system that physically prevents the model from answering factual questions without verification.
Here's how it works.
The pipeline: what happens when you ask a question
When a message arrives in AitherOS, it passes through a pipeline of components before the LLM ever sees it. Each component makes a decision, and those decisions compound into an execution plan that controls exactly how the model is allowed to respond.
Here's the full path for "which month is spelled with an X?":
Step 1: Intent classification
IntentEngine.classify() analyzes the query. This is a local classifier — no LLM call, sub-millisecond — that determines what kind of request this is. It returns:
- effort: 3 (factual question, not trivial, not research-level)
- intent:
question - depth:
BALANCED
This is important: the intent engine doesn't try to answer the question. It just categorizes it so the rest of the pipeline knows what kind of processing to apply.
Step 2: SASE pattern matching
The CriticalityGate — part of our SASE (Surgical Augmented Selective Execution) framework — runs a bank of regex patterns against the query. SASE is a four-phase cognitive loop: Situation (gather context), Analysis (break down the problem), Synthesis (combine insights), Execution (act and validate). The CriticalityGate decides whether a query needs the full SASE cycle or can skip straight to execution. These patterns identify questions where LLMs are known to hallucinate confidently.
In lib/cognitive/SASEIntegration.py, the FORCE_REASONING_PATTERNS list includes:
# Spelling/counting/letter questions (model hallucinates confidently)
r"\bspelled?\s+with\b",
r"\bhow\s+many\s+(letters|vowels|consonants|syllables)\b",
r"\bcontains?\s+(the\s+)?(letter|character|digit)\b",
Our query matches \bspelled?\s+with\b. The gate returns a CriticalityDecision:
CriticalityDecision(
should_reason=True,
depth=ReasoningDepth.STANDARD,
criticality_score=0.6,
trigger_pattern=r"\bspelled?\s+with\b",
)
This decision propagates downstream. The model must reason about this question — it's not optional.
Step 3: RuntimeConfig gets force_include_tools
In ChatEngine, when SASE engages, the reason tool gets force-included:
if _sase_decision and _sase_decision.should_reason:
config.force_include_tools = ["reason"]
This is the critical wiring point. Without this line, the SASE decision would fire but nothing downstream would know about it. The force_include_tools field on RuntimeConfig tells ToolGraph to guarantee these tools are available, and tells the BudgetToken to enforce their use.
Step 4: ToolGraph preselects tools
AitherOS uses a ToolGraph system that selects which tools to present to the model for each request. Presenting all 100+ tools would waste context window space and confuse the model. Instead, ToolGraph selects a relevant subset.
But some tools should always be available. The reason tool is in the _ESSENTIAL_TOOLS set:
_ESSENTIAL_TOOLS = {
"read_file", "write_file", "replace_in_file",
"search_files", "list_directory", "code_search",
"reason"
}
Even if ToolGraph's selection algorithm doesn't pick reason, the essential tools gate adds it back. Combined with force_include_tools, the model will always see the reason tool for this request.
Step 5: BudgetToken enforcement
Here's where the magic happens. The RuntimeConfig sets min_tool_calls=1 when force_include_tools is populated. This means the ReAct loop physically cannot finish until at least one tool has been called.
The BudgetToken tracks this:
@property
def can_finish(self) -> bool:
return (
self.turns_used >= self.min_turns
and self.tool_calls_used >= self.min_tool_calls
)
If the model tries to produce a final answer without calling a tool, can_finish returns False.
Step 6: The nudge
When the model tries to finish early (which it will — its instinct is to just say "October"), the ReAct loop intercepts the response and injects a nudge:
if _tools_remaining > 0:
if "reason" in _force_tools:
nudge = (
"You MUST call the `reason` tool IMMEDIATELY with the "
"user's question as the argument. Do NOT call tool_search, "
"web_search, or any other tool first. The reason tool will "
"help you verify your answer step by step."
)
This nudge is injected as a system message with [REASONING BUDGET] prefix. The model sees it, understands it's being asked to verify, and calls the reason tool.
Step 7: The reason tool fires
The reason tool is not just another LLM call. It has a carefully crafted system prompt in lib/orchestration/conversational_tools.py:
_sys_content = (
"You are a critical reasoning engine. Your ONLY job is to find the "
"CORRECT answer through rigorous step-by-step verification.\n\n"
"MANDATORY PROCESS:\n"
"1. Restate exactly what is being asked — no assumptions\n"
"2. For factual claims: VERIFY by enumeration (spell out words letter "
"by letter, list all items, count explicitly)\n"
"3. Check if the 'obvious' answer actually satisfies the question\n"
"4. If you find the obvious answer is WRONG, say so clearly\n"
"5. State your verified conclusion with the evidence\n\n"
"CRITICAL: Do NOT trust your intuition. Do NOT skip verification "
"steps. The user is asking because the first model already got this "
"wrong."
)
Two things make this work:
-
"Spell out words letter by letter" — Forces the model to enumerate rather than pattern-match. When you spell O-C-T-O-B-E-R, there's no X. The model sees this.
-
"The first model already got this wrong" — Primes the reasoning model to be skeptical of the obvious answer. Without this, the reasoning model would make the same mistake.
The reason tool routes through LLMGateway with tier="reasoning", which dispatches to the reasoning model (DeepSeek-R1 or Gemma 4, depending on what's loaded). The reasoning model enumerates all twelve months, spells each one out, and concludes: no month in the Gregorian calendar contains the letter X.
Correct answer. Verified answer.
The architecture diagram
Here's the full flow:
User: "Which month is spelled with an X?"
│
▼
IntentEngine.classify()
│ effort=3, intent="question"
▼
CriticalityGate.assess()
│ pattern match: \bspelled?\s+with\b
│ → should_reason=True
▼
ChatEngine._process_agentic()
│ config.force_include_tools = ["reason"]
│ config.min_tool_calls = 1
▼
ToolGraph.preselect()
│ reason ∈ _ESSENTIAL_TOOLS → always included
▼
AgentRuntime ReAct Loop
│
├─ Turn 1: Model tries "October"
│ └─ BudgetToken: can_finish=False (0/1 tools used)
│ └─ Nudge: "Call reason tool IMMEDIATELY"
│
├─ Turn 2: Model calls reason("which month is spelled with an X?")
│ └─ _execute_reason() fires
│ └─ System prompt: "spell out words letter by letter"
│ └─ LLMGateway → reasoning model
│ └─ Enumerates all 12 months → no X found
│ └─ Returns: "No month contains the letter X"
│
└─ Turn 3: Model incorporates reasoning result
└─ BudgetToken: can_finish=True (1/1 tools used)
└─ Final answer: "No month of the year is spelled with an X"
The wiring problem
Here's the thing that took us the longest to figure out: every single component in this pipeline worked correctly in isolation. The tests all passed. The reason tool, when called directly, gave the right answer every time. The SASE patterns matched the right queries. The BudgetToken enforced tool minimums perfectly.
The bug was always in the wiring — the connections between components.
Wire 1: SASE patterns were too narrow
The original patterns caught "which" but not "what." So "which month is spelled with an X?" triggered reasoning, but "what month has an X in it?" didn't. We expanded the pattern set to cover the full family of factual verification queries:
r"\bwhich\s+\w+\s+(is|has|was|are|were|does)\b",
r"\bwhat\s+\w+\s+(is|has|was|are|were|does|in)\b",
Wire 2: The reason tool wasn't registered
The reason tool was defined in conversational_tools.py as a schema and an executor function. But it wasn't registered in the ToolRegistry that ToolGraph reads from. The tool existed in code but was invisible to the system. Adding it to _ESSENTIAL_TOOLS and ensuring it appeared in the tool schema for every agentic request fixed this.
Wire 3: deep_reasoning vs. _execute_reason
There were two code paths for "make the model think harder":
deep_reasoning— dispatched through AgentForge, which used a generic agent prompt_execute_reason— a direct LLM call with the verification-focused system prompt
The deep_reasoning path sounded fancier but used the wrong prompt. The model would "reason deeply" and still conclude "October" because nobody told it to spell things out. We unified on _execute_reason with its verification prompt.
Wire 4: python_repl swallowed results
When the reasoning model tried to verify programmatically — writing Python like [m for m in months if 'x' in m.lower()] — the REPL returned nothing visible. The expression evaluated to [] (empty list), but bare expressions in Python don't produce output. The model saw empty output and assumed its code was wrong.
We fixed this with AST-based expression detection in ActionExecutor.py:
_tree = _ast.parse(code)
if _tree.body and isinstance(_tree.body[-1], _ast.Expr):
_last_src = _ast.get_source_segment(code, _last_node)
code = (
_prefix
+ f"_expr_result_ = {_last_src}\n"
+ "if _expr_result_ is not None: print(repr(_expr_result_))"
)
Now [m for m in months if 'x' in m.lower()] outputs [], which the model correctly interprets as "no months contain X."
Wire 5: force_include_tools wasn't propagated
The SASE decision set should_reason=True, but ChatEngine didn't read that decision and set force_include_tools on the RuntimeConfig. The decision fired, got logged, and was forgotten. One line of code:
if _sase_decision and _sase_decision.should_reason:
config.force_include_tools = ["reason"]
Five disconnected wires. Each one individually would have been a minor bug — "reason tool not always available," "REPL output sometimes empty." But together they created a system where the model could confidently hallucinate on every factual verification question.
The key design decisions
Why reason is an essential tool
We could have made the reason tool only appear when SASE triggers. But that creates a chicken-and-egg problem: what if the pattern list doesn't cover a new category of hallucination-prone questions?
By making reason an essential tool — always available, like read_file — we give the model the option to verify itself even when SASE doesn't trigger. The model is trained to recognize when it's uncertain, and having the tool available means it can self-correct without needing a pattern match.
SASE is the safety net. The essential tool is the first line of defense.
Why verification prompts mention "the first model got this wrong"
Without this priming, the reasoning model makes the same mistakes as the orchestrator. Both are LLMs, both have the same pattern-matching shortcuts. Telling the reasoning model "the first model already got this wrong" activates a different behavior: skepticism of the obvious answer.
This is a form of adversarial prompting — not against the user, but against the model's own biases.
Why BudgetToken is a hard gate, not a suggestion
Early versions used a softer approach: "We recommend you use the reason tool." The model ignored it 60% of the time. LLMs are optimized to produce answers quickly. Given a choice between "think carefully" and "just answer," they'll answer.
The BudgetToken removes the choice. can_finish returns False until the tool minimum is met. The ReAct loop literally cannot produce a final answer. This isn't a prompt technique — it's a control flow constraint.
What it looks like in practice
Before the fix:
User: Which month is spelled with an X?
Aither: October! The month of October contains the letter X.
After the fix:
User: Which month is spelled with an X?
Aither: [calls reason tool]
→ Reasoning: Let me spell out each month...
→ January: J-A-N-U-A-R-Y (no X)
→ February: F-E-B-R-U-A-R-Y (no X)
→ March: M-A-R-C-H (no X)
→ ... [enumerates all 12] ...
→ December: D-E-C-E-M-B-E-R (no X)
→ Conclusion: No month contains the letter X.
Aither: No month of the year is spelled with the letter X.
I checked each one: January through December, none of them
contain an X.
The answer takes 3-4 seconds longer. That's the cost of verification. We think it's worth it.
MCTS verification: multi-strategy branching
The pipeline above is linear: detect → force tool → verify → answer. It works for single-verification questions like the October bug. But some problems need branching exploration.
Consider: "Which US state capitals have more vowels than consonants?" This requires checking 50 capitals, and the verification tree branches at each one. A single ReAct loop might time out or lose track.
We built MCTS (Monte Carlo Tree Search) into the reasoning pipeline. When the CriticalityGate classifies a question as hallucination-prone (spelling, counting, letter containment), it sets the SASE depth to sase_verify. At this depth, _execute_reason() doesn't run a single LLM call — it runs MCTS on a ReasoningVerificationEnv to select the optimal combination of verification strategies.
17 strategies across 5 tiers
The original env had 5 prompt-only strategies. The current version has 17, organized into tiers that compose real AitherOS subsystems:
| Tier | Strategy | What it does | Cost |
|---|---|---|---|
| 1 | chain_of_thought | Standard step-by-step reasoning | 0.1 |
| 1 | enumeration | Spell out every item, list everything explicitly | 0.1 |
| 1 | adversarial_check | Assume the obvious answer is wrong, find evidence | 0.1 |
| 1 | decomposition | Break into sub-problems, solve independently | 0.1 |
| 1 | programmatic | Write Python code to verify computationally | 0.1 |
| 2 | tool_grounded_reason | Mini ReAct loop with web_search, read_file, query_memory | 0.3 |
| 2 | code_act_verify | CodeActEngine: generate + sandbox-execute verification code | 0.3 |
| 2 | web_cross_check | Search the web for factual claims, compare sources | 0.25 |
| 3 | council_review | 6 specialist reviewers score the answer in parallel | 0.6 |
| 3 | prism_rotate | PrismStrategy: re-approach from rotated cognitive persona | 0.4 |
| 3 | swarm_verify | SwarmCodingEngine plan_only: ARCHITECT decomposes, JUDGE scores | 0.8 |
| 3 | agent_specialist | Forge-dispatch to domain expert (lyra, athena, atlas...) | 0.5 |
| 4 | frontier_judge | FrontierJudge (Claude Opus) review with cost guards | 1.0 |
| 4 | judge_engine | JudgeEngine.evaluate_quality() with criteria from quality.yaml | 0.4 |
| 4 | routing_judge_verify | Fast local validation via RoutingJudge (Gemma 4b) | 0.15 |
| 5 | consensus_vote | Run 3 Tier-1 strategies, majority vote | 0.35 |
| 5 | escalation_cascade | Start cheap (Tier 1), escalate if confidence is low | 0.5 |
Tier 1 strategies are prompt templates — always available, very cheap. Tier 2+ strategies dispatch to real AitherOS subsystems: the tool-calling ReAct loop, the CodeActEngine sandbox, the 6-specialist CouncilReview, the SwarmCodingEngine's ARCHITECT→JUDGE pipeline, AgentForge dispatch to specialist agents, and the FrontierJudge quality gate. Each executor returns a standardized result with answer, confidence, evidence, and cost incurred.
Cost-aware planning
The env has a cost budget that flows from the EffortScaler. Low-effort questions (effort 1-2) get a budget of 0.3 — only Tier 1 strategies fit. High-effort questions (effort 9-10) get a budget of 2.0 — enough for a FrontierJudge call. MCTS factors cost into the reward function: it penalizes plans that blow the budget and rewards cost-efficient verification.
Service availability filtering runs at plan time. If Genesis isn't running, Tier 2-4 strategies that require it are filtered out before MCTS even starts. The system gracefully degrades to prompt-only verification — still better than a single LLM call.
Scoring
MCTS scores strategy combinations on six axes: consistency potential (do different strategy types agree?), enumeration completeness, adversarial coverage, strategy diversity, tier diversity (mixing tiers is more valuable than same tier), and cost efficiency. For the October bug, MCTS consistently selects [enumeration, adversarial_check] — spell out every month, then challenge the obvious answer.
After selection, the strategies execute in parallel. Each one dispatches to its real subsystem (or falls back to prompt-only if the subsystem is unavailable). Their answers are then cross-checked by a synthesis pass that explicitly trusts enumeration and programmatic strategies over chain-of-thought when they disagree:
User: "Which month is spelled with an X?"
│
▼
CriticalityGate → sase_verify (spelling pattern matched)
│
▼
EffortScaler → cost_budget=1.2 (effort 7)
│
▼
Service probe → Genesis available → Tier 2-4 unlocked
│
▼
MCTS selects: [enumeration, adversarial_check, code_act_verify]
│
├─ enumeration (T1): spells all 12 months → no X found
├─ adversarial_check (T1): assumes "October" is wrong → confirms no X
└─ code_act_verify (T2): CodeActEngine runs sandbox Python → []
│
▼
Synthesis: all 3 strategies agree → "No month contains the letter X"
│ confidence: high (3/3 consensus, cross-tier)
▼
Final answer with multi-strategy verification tag
The MCTS search runs in ~100 iterations, under 3 seconds. The strategy execution adds the cost of 2-3 parallel calls — but these run concurrently, so the total latency is roughly the time of one call plus synthesis. For a question where getting it wrong would be embarrassing, that's a good trade.
The October bug taught us something fundamental: LLMs don't hallucinate because they're stupid. They hallucinate because they're lazy. They take shortcuts because shortcuts are usually right, and the training objective rewards being right quickly over being right carefully.
The fix isn't smarter models. It's systems that won't let the model skip the work.