One Endpoint, Two Execution Paths: How We Fixed Demo Chat Timeouts Without Forking the API
The error banner said timeout.
The containers were healthy.
Genesis was healthy. Veil was healthy. MicroScheduler was healthy. vLLM was healthy. The demo page still showed:
Request timed out — the backend took too long to respond. Try again.
So we did what every engineering team eventually has to do: stop trusting the label on the failure and trace the request end-to-end.
The result was one of those bugs that looks like infrastructure, feels like latency, and turns out to be routing policy.
The fix is not “increase the timeout again.” The fix is to keep one public endpoint while giving it two internal execution modes.
The Symptom
The public demo chat on the website is supposed to feel simple:
- ask Aither a question,
- get a fast, polished answer,
- look stable for visitors,
- never melt down into visible orchestration chaos.
But one of the canned questions — “What makes you different from ChatGPT?” — intermittently looked broken. The UI would sit there thinking, then eventually show a timeout banner.
That led to the obvious suspicion: maybe the backend really was timing out.
It wasn’t.
The First Wrong Theory: The Services Are Down
We checked the obvious suspects first:
aitheros-veilaitheros-genesisaitheros-microscheduleraitheros-gatewayaither-vllmaither-vllm-reasoning
They were up and healthy.
That immediately narrowed the problem. If the stack is healthy and the UI still shows a timeout, either:
- the frontend is aborting too aggressively,
- the proxy route is returning the wrong error,
- or the backend is technically alive but taking a catastrophically wrong path.
The third option turned out to be the real one.
The Key Discovery: The Website Demo Was Not Using the Path We Thought It Was
There are effectively two execution styles in AitherVeil:
- a deep orchestrated path through
/api/v2/chat - a fast conversational path through
/api/agent-chat
At a glance, those sound similar. They are not.
The website demo chat was going through the unified route in AitherOS/apps/AitherVeil/src/app/api/v2/chat/route.ts, not the simpler agent route.
That mattered because /api/v2/chat is built to support the full system:
- deeper orchestration,
- multi-turn internal reasoning,
- tool use,
- richer streaming telemetry,
- more agentic behavior.
That is exactly what you want for power users, internal surfaces, and serious agent work.
It is not what you want for a public demo where the main requirement is: ask a clean question, get a clean answer, fast.
What the “Timeout” Actually Was
When we replayed the real website request against /api/v2/chat, the route did not die immediately. It streamed. It stayed alive. It kept producing events.
That sounds good until you see what it was doing.
It was getting stuck in a long multi-turn loop:
- thinking,
- calling
tool_search, - getting mediocre tool results,
- thinking again,
- calling
tool_searchagain, - accumulating more and more context,
- repeating this across many turns.
Eventually the answer was no longer “slow.” It was too large to fit.
The model backend finally threw a hard context-window error.
Conceptually, the failure looked like this:
user asks simple demo question
↓
/api/v2/chat takes deep orchestration path
↓
agent enters repeated tool / thinking loop
↓
context grows every turn
↓
vLLM rejects oversized request
↓
frontend shows generic timeout/error banner
So the UI message said “backend took too long,” but the root cause was closer to:
the demo chat got routed into an execution mode that was too deep, too tool-happy, and too unstable for a simple public-facing conversation.
The Two Paths
Here’s the practical difference.
Path A — Unified v2 Chat
This is the powerful path. It’s the general-purpose route.
Website Chat UI
│
└── POST /api/v2/chat
│
└── Aeon /v2/chat
│
├── starts SSE
├── planning / thinking turns
├── tool calls
├── deeper orchestration
└── final answer
When it works, it’s the most capable path.
When it misfires on a trivial public demo prompt, it can become a latency amplifier.
Path B — Fast Conversational Chat
This is the safer path for narrow conversational flows.
Website Chat UI
│
└── POST /api/agent-chat
│
├── immediate heartbeat
├── classify + select-llm
├── Genesis /chat
├── fallback Genesis /agent
├── fallback Aeon /chat
└── fallback trivial/offline response
This path is much better aligned with the website demo’s actual UX contract:
- stay alive visibly,
- answer quickly,
- avoid pathological loops,
- degrade gracefully.
And in direct repro, it answered the same demo prompt in about 16 seconds with a normal result.
The Wrong Fix: Split the Frontend Across Two Public APIs
The fastest tactical fix is obvious:
- keep normal chat using
/api/v2/chat - make the website demo call the fast conversational route at
/api/agent-chat
That works.
It is also the wrong long-term API shape.
Once you let the frontend pick different endpoint families for what is conceptually the same feature, you start leaking execution policy into the UI layer. Then every product surface grows its own chat contract, and now you are debugging not just models and routes, but which frontend happened to pick which backend flavor six weeks ago.
That’s how architecture rots.
The Better Fix: One Public Endpoint, Two Internal Execution Modes
The cleaner design is:
- keep one public endpoint:
/api/v2/chat - add a small routing hint in the body, like
chat_profile: "demo" - let the route decide the internal strategy
That preserves the external API surface while still giving us different execution behavior for different product contexts.
Conceptually:
Website Chat UI
│
└── POST /api/v2/chat { chat_profile: "demo" }
│
├── if demo profile:
│ internally proxy to the fast conversational path
│
└── otherwise:
use normal unified v2 path
This is the part that matters architecturally:
The frontend expresses intent, not routing topology.
The UI should say “this is a demo conversation.”
The backend should decide what that means operationally.
Why a Demo Profile Is the Right Abstraction
A demo profile is not a hack. It is an explicit statement that different conversation contexts need different orchestration budgets.
A public website chat has different goals than:
- a coding surface,
- an internal operator dashboard,
- a long-running research session,
- or a multi-agent workspace.
For the demo profile, we want something like this:
- lower behavioral variance,
- minimal tool usage,
- fast first useful answer,
- strong fallback behavior,
- reduced chance of deep reasoning loops,
- and very low probability of context blow-up.
For power-user chat, we want the opposite:
- deeper orchestration,
- richer tools,
- more thinking,
- more exploration,
- and willingness to spend latency budget for better output.
Those are not separate products. They are separate policies behind the same product capability.
The Real Lesson
The most important part of this bug wasn’t the timeout.
It was the reminder that “one endpoint” and “one behavior” are not the same thing.
A clean system can expose one stable interface while still making careful choices internally about:
- latency budget,
- fallback strategy,
- tool allowance,
- reasoning depth,
- and loop tolerance.
That is what profiles are for.
Without that distinction, you end up with one of two bad outcomes:
- everything is shallow and your serious users get underpowered behavior, or
- everything is deep and your demo chat occasionally turns into an accidental stress test.
We were hitting the second case.
The Diagram We Ended Up Using Internally
This is the shortest way to explain the fix to ourselves:
ONE PUBLIC CONTRACT
POST /api/v2/chat
│
┌──────────────┴──────────────┐
│ │
│ chat_profile = "demo" │ default / deep
│ │
▼ ▼
fast conversational path deep orchestrated path
(internally /api/agent-chat) (Aeon /v2/chat)
│ │
├── immediate heartbeats ├── deeper reasoning
├── simpler answer path ├── tool use
├── strong fallbacks ├── richer telemetry
└── stable demo UX └── higher capability ceiling
That is the design in one picture.
What This Avoids
This approach avoids three common failure modes at once.
1. Frontend Routing Sprawl
The website doesn’t need to know whether the request should hit Genesis /chat, Genesis /agent, Aeon /v2/chat, or some future route.
It just declares the chat profile.
2. Capability Regression
We do not have to dumb down /api/v2/chat for every caller just because the public demo needs a safer lane.
The deep path stays available for the places that actually benefit from it.
3. Timeout Theater
A lot of systems “fix” these bugs by stretching timeouts until the UX fails more slowly.
That doesn’t solve the problem.
If the route is making the wrong execution decision, a bigger timeout just gives it more time to be wrong.
The Broader Pattern
This is not just a chat fix. It’s an architectural pattern we’ll keep reusing:
- one stable public contract
- context-specific execution profiles
- policy decided server-side
- fallback ladders matched to product intent
That pattern scales much better than proliferating endpoints every time two surfaces need slightly different behavior.
Today it’s demo chat.
Tomorrow it might be:
chat_profile: "developer"chat_profile: "research"chat_profile: "operator"chat_profile: "public_demo"
Same API. Different orchestration policy.
That’s the right level of abstraction.
Conclusion
The public chat wasn’t broken because the stack was down.
It was broken because a simple public question was being routed through a path designed for deeper, more exploratory agent behavior. The system was alive the whole time. It was just being too clever in the wrong place.
So the answer is not “add more timeout.”
The answer is:
- keep one endpoint,
- add an explicit demo profile,
- and let the backend choose the fast path internally.
One public API. Two execution paths. Much less chaos.
This post is part of the AitherOS engineering series on routing, context, and production AI systems. Related reading: The Essay Principle: Why Your AI's Context Window Is an Encyclopedia When It Should Be a Briefing, Your AI Guesses Which Tools to Use. Ours Plays Chess., and AI That Runs What It Writes: Wiring Live Docker Containers Into the Agent Pipeline.