When Your AI Can't Hear You
When Your AI Can't Hear You
How a single misclassified intent broke an entire creative pipeline — and what fixing it revealed about cognitive architecture
I typed six words into AitherShell: "can you output a short video explaining the 6 pillars of aitheros."
Aither thought about it. It fired two tools — a knowledge search and a reasoning chain. Then it stopped. No video. No explanation of why there was no video. Just silence after the tool results streamed back. The reasoning chain was truncated at 600 characters. The stream ended with the LLM's internal tool-call markup still visible in the output.
The intent classification said: conversation | Effort: 3.
Conversation. Effort three. For a request to render a video.
This is the story of how I traced that failure through six files across three architectural layers, and what it taught me about building systems that actually understand what you're asking for.
The Three Bugs That Looked Like One
When something doesn't work in a 200-service system, the instinct is to blame the obvious suspect. The video renderer must be broken. The tool must not be registered. But the real failure was upstream — way upstream — in the moment the system decided what I was asking for.
Bug 1: The intent engine couldn't recognize creative requests in natural language.
AitherOS classifies every message through a multi-layer intent engine: regex pattern matching, semantic category mapping, and a lightweight NanoGPT scoring pass. The CREATION intent patterns looked like this:
^create\b, ^generate\b, ^make\b, ^build\b, ^scaffold\b, ^write\b
Every single pattern was anchored to the start of the line with ^. If your message started with "create a video," great. If it started with "can you output a short video" — which is how humans actually talk — nothing matched. The system fell through to conversation.
Meanwhile, the semantic category map had "video": IntentType.COMMAND. Not CREATION. COMMAND. So even when the video category matched, it routed to the wrong intent type — one that doesn't inject creative tools.
Bug 2: The effort score was too low to unlock the tools needed.
AitherOS uses an effort scale from 1 to 10 that controls everything downstream: which model gets used, how many tool turns are allowed, how much context gets assembled, and which tools are even offered to the LLM. Effort 3 means "trivial conversation" — one tool turn max, minimal context, no creative tools injected.
Video rendering needs at least effort 5 to get the Remotion tools into the tool selection. The video category in the YAML config had effort_floor: 4. The CREATION intent type had no effort floor at all. So even in the best case, a video request was getting scored below the threshold needed to actually make a video.
Bug 3: When the follow-up LLM call failed, the system went silent.
After tools execute in a chat turn, the UnifiedChatBackend makes a follow-up LLM call with the tool results, asking the model to synthesize a human-readable response. If that call times out or errors — which happens when the model context is too thin because effort was too low — the exception handler kept the raw tool-call instruction text from the LLM's previous response. Not the tool results. Not a synthesis. The actual <tool_call> XML that was never meant to be shown to users.
Three bugs. Three layers. One silent failure.
Pattern Matching Is a Conversation About Expectations
The fix for Bug 1 was technically simple but philosophically interesting. I added non-anchored patterns:
\b(create|generate|make|produce|render|output)\s+(a\s+)?(short\s+)?(video|clip|animation|...)
\b(video|clip|animation|slideshow)\s+(about|explaining|showing|of|for)\b
These catch the way people actually phrase creative requests — with preambles, hedging, politeness markers. "Can you make a video about..." "I'd like you to output a short clip explaining..." "Could you render an animation of..."
But here's the deeper issue: anchored patterns are a developer's model of language. Developers think in commands. create <thing>. run <command>. Start with the verb, follow with the object. That's how CLIs work. That's how APIs work.
Users think in requests. They wrap the action in social context. "Can you..." "I need..." "Would it be possible to..." The verb is buried in the middle of the sentence, surrounded by conversational scaffolding that the pattern engine was trained to ignore.
This isn't just a regex bug. It's an architectural assumption about who's talking. And in a system designed to be conversational, that assumption was exactly wrong.
Effort Is the Hidden Control Plane
Most people who look at AitherOS see the tools, the agents, the model routing. They don't see the effort system. But effort is arguably the most important decision the system makes, because it controls the shape of every downstream computation.
Here's what changes between effort 3 and effort 8:
| Dimension | Effort 3 | Effort 8 |
|---|---|---|
| Tool turns allowed | 1 | 3 |
| Context assembly | Minimal | Full pipeline |
| Model selection | Fast/small | Full orchestrator |
| Max tokens | ~1024 | ~4096 |
| Tools offered | Basic | Full creative suite |
| Neuron pipeline | Skipped | All 12 stages |
When the system scored my video request at effort 3, it wasn't just picking the wrong number. It was collapsing the entire execution envelope. The model never saw the Remotion tools because they weren't offered. The context pipeline was skipped, so the model didn't know AitherOS has a video rendering service. The follow-up LLM call got a minimal token budget, making it more likely to time out.
Effort is a force multiplier. Get it wrong and everything downstream is underpowered.
The fix introduced three effort amplifiers for creative/media work:
- CREATION intent floor: Any CREATION-classified intent starts at effort ≥ 5
- YAML category floors: Video and generation categories set
effort_floor: 5 - Complexity amplifiers: Keywords like "video" (+3), "animation" (+3), "render" (+2) boost effort from the base score
My original request now scores effort 8: base 3 (short message) → floor 5 (CREATION) → +3 (video keyword) = 8. That unlocks the full creative pipeline.
Silent Failures Are the Hardest Kind
Bug 3 was the most insidious because it produced no error message, no crash, no degraded-but-visible output. The stream just... stopped.
Here's what was happening in the tool loop:
1. LLM generates response with <tool_call> markup
2. Backend parses and executes tools (knowledge_search, reason)
3. Tools succeed, return results
4. Backend makes follow-up LLM call with tool results
5. Follow-up call fails (timeout/context overflow)
6. Exception handler: keep original LLM content (the <tool_call> markup)
7. Stream the markup as the final response
8. AitherShell can't render <tool_call> tags → shows nothing
The fix was to check whether any tools had already executed successfully, and if so, synthesize a response from their output rather than falling back to the raw LLM text. It's not as good as a proper follow-up synthesis, but it's infinitely better than silence.
This is a pattern I keep encountering in complex systems: the failure mode that produces less output instead of wrong output. Users can correct wrong output. They can't correct nothing. A system that fails visibly is more trustworthy than one that fails invisibly.
The Cognitive Stack
Looking at this from altitude, what we're really building is a cognitive stack — layers of interpretation between a human's natural language and a system's capabilities:
Human language: "can you output a short video explaining the 6 pillars"
↓
Intent layer: CREATION (not conversation)
↓
Effort layer: 8 (not 3)
↓
Tool selection: render_remotion_video, list_remotion_compositions
↓
Context assembly: Full pipeline — knowledge, architecture, tool schemas
↓
Generation: LLM plans + executes tool calls
↓
Recovery: Synthesize from tool results if follow-up fails
↓
Rendering: Video composition → user
Each layer is a decision point. Each decision constrains what's possible downstream. Get the intent wrong, and effort is wrong. Get effort wrong, and tools are missing. Get tools wrong, and the model can't act. Get the error handling wrong, and success is invisible.
This is why I keep saying that the hard problem in AI isn't generation — it's understanding. Not understanding language, but understanding intent. The LLM in this pipeline is perfectly capable of rendering a video. It has the tools, the knowledge, the reasoning capability. The failure was entirely in the system's ability to hear what was being asked.
What Changed
Eight files. Two new. Six modified. 913 lines of change across:
- IntentEngine.py — Non-anchored CREATION patterns, video→CREATION mapping, complexity amplifiers, effort floors
- intent_categories.yaml — Expanded video/generation categories with Remotion tools, raised effort floors
- intent_classifier.py — YAML categories override trivial heuristic fallback, consistent generic-intent logic
- UnifiedChatBackend.py — Tool loop exception recovery, system introspection gate
- ThinkMiddlewares.py — Auto-create notebooks for MCTS delegation chains
- MCTSPlanner.py — Configurable LLM evaluation backends
- reasoning_models.yaml (new) — Multi-profile reasoning model config (local, Claude, Gemini, OpenAI, DeepSeek, hybrid)
- ReasoningModelSelector.py (new) — Config-driven backend selector for MCTS planning
All 321 tests pass. The original prompt now correctly classifies as CREATION | Effort: 8 and gets the full creative pipeline with Remotion video tools.
The Lesson
When an AI system fails to do what you asked, the problem is almost never that it can't. The problem is that somewhere between your words and its capabilities, a classification boundary was drawn in the wrong place. A threshold was set too low. An error was swallowed instead of surfaced.
The systems we're building are capable enough. The question is whether they can hear you.
David Parkhurst is the architect of AitherOS — an autonomous agent operating system with 200+ microservices, where the hardest engineering problems aren't about generation, but about understanding what you meant in the first place.