Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

architecturemctsintent-classificationtool-selectiondeep-diveroutingeffort-scalingpost-mortem

Six Layers of Wrong: How a Regex Broke Our Entire Intelligence Stack

April 13, 202615 min readDavid Parkhurst

Six Layers of Wrong

"How do you know the time?"

Here's what Aither said:

I was trained on vast text data... NTP servers synchronize with atomic clocks... time zones are offset from UTC...

Here's what Aither should have said:

EcosystemClient.get_time_awareness() injects the current time into my context pipeline every turn. It's part of the Pulse service.

Both are true. One is the answer to a different question. The user asked about this system's mechanism. They got a Wikipedia article.

The answer existed. It was sitting in the codegraph index, two function calls from the LLM. But six independent systems — each individually reasonable, each doing exactly what it was designed to do — formed a kill chain that made it unreachable.

This is the post-mortem.

Layer 1: A regex stole the intent

The intent classifier runs regex patterns against the message before anything else. self_awareness is a priority category — checked first, short-circuits everything:

self_awareness:
  patterns:
    - '\bdo\s+you\s+(think|dream|experience|understand|know)\b'

"How do you know the time?" matches do\s+you\s+know. The system thinks you're asking an existential question — are you a conscious being that possesses knowledge? — when you're asking a mechanical one: what function call gives you the clock?

The FacultyIntentEngine (the real classifier, using actual NLP) correctly returned question. But self_awareness is a priority category. The regex ran first and won.

One word — "know" — has two meanings in English. The regex can't tell them apart.

Fix: "Do you know" only matches at end-of-sentence (the existential "do you know?"), not when followed by an object ("do you know the time?"). Two lines of YAML.

Layer 2: Effort said "no tools"

Intent is now correct: question, effort 3. But the EffortScaler has a table:

1: "direct",     2: "direct",
3: "tools" if needs_tool else "direct",

Effort 1-2: no tools ever. Effort 3-4: tools only if the intent classifier set needs_tool=True. question returns needs_tool=False.

So: effort 3 + question = direct mode = zero tools offered to the LLM.

This is the architectural mistake we kept making. Effort was a binary gate: below a threshold, the system is deaf. Above it, the system can act. The LLM never gets to decide for itself.

Fix: Effort is a dial, not a gate. Every level gets orchestration_mode: "tools". The LLM always has tools. Effort controls how many turns, which model, how deep the reasoning — not whether tools exist.

Layer 3: plan_tools() had its own gate

Even after fixing EffortScaler, plan_tools() has:

_LOW_EFFORT_NO_TOOL_INTENTS = {"question"}
if intent_type in _LOW_EFFORT_NO_TOOL_INTENTS and effort_level <= 2:
    return ToolPlan(tools=[], tool_choice=None, max_tool_turns=0)

A second gate, independently blocking the same thing. Different file, different author, same assumption: questions at low effort don't need tools.

Fix: Removed. Only pure social intents (greeting, thanks, farewell) at effort 1-2 skip tools. Questions always get tools.

Layer 4: The second-guesser

EffortScaler now sets reasoning_depth: "gate" at effort 3. Good — "gate" means the LLM decides whether to reason. Then DeepReasoningHelper runs:

DeepReasoningHelper override: gate → skip (confidence=0.90)

DRH is a second classifier. It scores messages for reasoning triggers: "step by step", "analyze", "root cause". "How do you know the time?" has none. Score: 0.105 out of 1.0. DRH says: trivial. Override to skip.

The effort table — the thing that knows about the always-agentic architecture we just built — said gate. DRH — a keyword matcher that doesn't know any of that — overrode it.

Fix: Deleted from the EffortScaler. The effort table is the authority. No second-guesser.

Layer 5: MCTS was solving the wrong problem

With everything unblocked, the Monte Carlo Tree Search tool router should find the right tools. But MCTS was designed for sequential decision-making — chess, Go, multi-step planning. We were using it to pick an unordered set of 5 tools from a pool of 10.

That's 252 possible combinations. You can enumerate them all in under a millisecond. You don't need tree search.

Worse: the keyword-based prefilter reduced 10 tools to 2-3 before MCTS even ran. With fewer candidates than the target size, MCTS was skipped entirely on most requests. The Monte Carlo Tree Search engine that was supposed to find optimal tool combinations was a no-op.

And when it did run, it scored tools by keyword F1 overlap — literal word matching between the query and the tool description. reason (the most relevant tool for "how do you know the time?") scored 0.000 because the word "reason" doesn't appear in "how do you know the time."

Three changes:

Semantic scoring. Cosine similarity over 768-dim embeddings instead of keyword matching. "Deep reasoning and analysis" is semantically close to "technical explanation" even though they share no words.
Chain ordering. MCTS now scores tool sequences, not sets. A chain where search comes before reasoning scores higher than reasoning before search. GATHER → PROCESS → PRODUCE. This is what tree search is actually for.
Intent fit. Tools matching the expected categories for the classified intent get boosted. question → {reasoning, search, information}. MCTS can't put find_ui_element in a research chain anymore.

Layer 6: The context pipeline was off

After all five fixes, the LLM has the right tools, the right intent, the right reasoning depth. It should answer correctly now.

It doesn't. It gives another philosophy lecture.

context_pipeline: False
context_chunks: None

The 12-stage context pipeline — the thing that injects codegraph, flux, axioms, identity, memory into the system prompt — has its own effort gate:

if effort <= 2:
    return await next(state)  # skip pipeline

"How do you know that?" gets effort 2. The context pipeline doesn't run. The LLM has no system knowledge. No codegraph, no axioms, no identity. It's raw model weights answering from pre-training. Of course it gives a generic answer — it literally doesn't know it's Aither.

Fix: The context pipeline runs at every effort level. The model needs to know who it is to answer questions about itself.

The result

Before:

Intent: self_awareness          ← regex stole it
Effort: 4, mode: direct         ← no tools
Reasoning: skip                 ← DRH overrode
Context pipeline: skipped       ← effort gate
Tools: [self_awareness]         ← not real tools
MCTS: skipped                   ← pool too small

Answer: "NTP servers... atomic clocks... training data..."

After:

Intent: question                ← correct
Effort: 3, mode: tools          ← always-agentic
Reasoning: gate                 ← LLM decides
Context pipeline: 12 chunks     ← flux, codegraph, axioms
Tool chain: knowledge_search → web_search → reason
MCTS: chain-ordered             ← GATHER→PROCESS

Answer: "EcosystemClient.get_time_awareness(), part of the Pulse service."

The lesson isn't about any one fix

Every layer was individually reasonable when it was written:

The regex was correct for existential questions
The effort gate saved compute on trivial messages
DRH prevented unnecessary reasoning on greetings
MCTS keyword matching worked when tool descriptions were keyword-rich
The context pipeline skip saved 2 seconds on "hello"

The problem was six reasonable defaults compounding into total blindness. Each layer assumed the others would handle the edge cases. None did. A question that should have been trivial to answer — what function call gives you the time? — required traversing every layer of the intelligence stack and finding a bug in each one.

The meta-fix isn't in any individual layer. It's the principle: the LLM decides, not the pipeline. Give it tools, give it context, give it reasoning access. Let it choose what to use. Stop building gates that assume you know better than the model what any given message needs.

Effort is a dial that controls depth: how many turns, which model, how much reasoning, whether to invoke council or swarm or expedition. It's not a gate that controls existence: whether tools are present, whether context is loaded, whether the model knows its own name.

Simple questions deserve real answers. The system always had this one. It just needed to stop getting in its own way.

What changed

Layer	File	What
Intent regex	`intent_categories.yaml`	"do you know" only matches standalone
Effort gate	`EffortScaler.py`	Always `"tools"`, never `"direct"`
Tool gate	`conversational_tools.py`	Questions always get tools
DRH override	`EffortScaler.py`	Removed entirely
MCTS scoring	`MCTSRouter.py`	Semantic + chain ordering + intent fit
Context gate	`ThinkMiddlewares.py`	Pipeline runs at all effort levels
Trace/observability	`UnifiedChatBackend.py`, `ThinkMiddlewares.py`, `chat_engine.py`	tool_chain, context_layers in metadata

65 tests. Zero regressions. The previous blog post was about squeezing performance from hardware. This one is about getting out of your own way.

Enjoyed this post?

All posts Try AitherOS