Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

engineeringaitherveilchatperformancearchitecturedebugging

One Endpoint, Two Execution Paths: How We Fixed Demo Chat Timeouts Without Forking the API

Name: AitherOS
Author: Aitherium

March 12, 202610 min readAitherium

The error banner said timeout.

The containers were healthy.

Genesis was healthy. Veil was healthy. MicroScheduler was healthy. vLLM was healthy. The demo page still showed:

Request timed out — the backend took too long to respond. Try again.

So we did what every engineering team eventually has to do: stop trusting the label on the failure and trace the request end-to-end.

The result was one of those bugs that looks like infrastructure, feels like latency, and turns out to be routing policy.

The fix is not “increase the timeout again.” The fix is to keep one public endpoint while giving it two internal execution modes.

The Symptom

The public demo chat on the website is supposed to feel simple:

ask Aither a question,
get a fast, polished answer,
look stable for visitors,
never melt down into visible orchestration chaos.

But one of the canned questions — “What makes you different from ChatGPT?” — intermittently looked broken. The UI would sit there thinking, then eventually show a timeout banner.

That led to the obvious suspicion: maybe the backend really was timing out.

It wasn’t.

The First Wrong Theory: The Services Are Down

We checked the obvious suspects first:

aitheros-veil
aitheros-genesis
aitheros-microscheduler
aitheros-gateway
aither-vllm
aither-vllm-reasoning

They were up and healthy.

That immediately narrowed the problem. If the stack is healthy and the UI still shows a timeout, either:

the frontend is aborting too aggressively,
the proxy route is returning the wrong error,
or the backend is technically alive but taking a catastrophically wrong path.

The third option turned out to be the real one.

The Key Discovery: The Website Demo Was Not Using the Path We Thought It Was

There are effectively two execution styles in AitherVeil:

a deep orchestrated path through /api/v2/chat
a fast conversational path through /api/agent-chat

At a glance, those sound similar. They are not.

The website demo chat was going through the unified route in AitherOS/apps/AitherVeil/src/app/api/v2/chat/route.ts, not the simpler agent route.

That mattered because /api/v2/chat is built to support the full system:

deeper orchestration,
multi-turn internal reasoning,
tool use,
richer streaming telemetry,
more agentic behavior.

That is exactly what you want for power users, internal surfaces, and serious agent work.

It is not what you want for a public demo where the main requirement is: ask a clean question, get a clean answer, fast.

What the “Timeout” Actually Was

When we replayed the real website request against /api/v2/chat, the route did not die immediately. It streamed. It stayed alive. It kept producing events.

That sounds good until you see what it was doing.

It was getting stuck in a long multi-turn loop:

thinking,
calling tool_search,
getting mediocre tool results,
thinking again,
calling tool_search again,
accumulating more and more context,
repeating this across many turns.

Eventually the answer was no longer “slow.” It was too large to fit.

The model backend finally threw a hard context-window error.

Conceptually, the failure looked like this:

user asks simple demo question
        ↓
/api/v2/chat takes deep orchestration path
        ↓
agent enters repeated tool / thinking loop
        ↓
context grows every turn
        ↓
vLLM rejects oversized request
        ↓
frontend shows generic timeout/error banner

So the UI message said “backend took too long,” but the root cause was closer to:

the demo chat got routed into an execution mode that was too deep, too tool-happy, and too unstable for a simple public-facing conversation.

The Two Paths

Here’s the practical difference.

Path A — Unified v2 Chat

This is the powerful path. It’s the general-purpose route.

Website Chat UI
   │
   └── POST /api/v2/chat
           │
           └── Aeon /v2/chat
                   │
                   ├── starts SSE
                   ├── planning / thinking turns
                   ├── tool calls
                   ├── deeper orchestration
                   └── final answer

When it works, it’s the most capable path.

When it misfires on a trivial public demo prompt, it can become a latency amplifier.

Path B — Fast Conversational Chat

This is the safer path for narrow conversational flows.

Website Chat UI
   │
   └── POST /api/agent-chat
           │
           ├── immediate heartbeat
           ├── classify + select-llm
           ├── Genesis /chat
           ├── fallback Genesis /agent
           ├── fallback Aeon /chat
           └── fallback trivial/offline response

This path is much better aligned with the website demo’s actual UX contract:

stay alive visibly,
answer quickly,
avoid pathological loops,
degrade gracefully.

And in direct repro, it answered the same demo prompt in about 16 seconds with a normal result.

The Wrong Fix: Split the Frontend Across Two Public APIs

The fastest tactical fix is obvious:

keep normal chat using /api/v2/chat
make the website demo call the fast conversational route at /api/agent-chat

That works.

It is also the wrong long-term API shape.

Once you let the frontend pick different endpoint families for what is conceptually the same feature, you start leaking execution policy into the UI layer. Then every product surface grows its own chat contract, and now you are debugging not just models and routes, but which frontend happened to pick which backend flavor six weeks ago.

That’s how architecture rots.

The Better Fix: One Public Endpoint, Two Internal Execution Modes

The cleaner design is:

keep one public endpoint: /api/v2/chat
add a small routing hint in the body, like chat_profile: "demo"
let the route decide the internal strategy

That preserves the external API surface while still giving us different execution behavior for different product contexts.

Conceptually:

Website Chat UI
   │
   └── POST /api/v2/chat { chat_profile: "demo" }
           │
           ├── if demo profile:
           │      internally proxy to the fast conversational path
           │
           └── otherwise:
                  use normal unified v2 path

This is the part that matters architecturally:

The frontend expresses intent, not routing topology.

The UI should say “this is a demo conversation.”

The backend should decide what that means operationally.

Why a Demo Profile Is the Right Abstraction

A demo profile is not a hack. It is an explicit statement that different conversation contexts need different orchestration budgets.

A public website chat has different goals than:

a coding surface,
an internal operator dashboard,
a long-running research session,
or a multi-agent workspace.

For the demo profile, we want something like this:

lower behavioral variance,
minimal tool usage,
fast first useful answer,
strong fallback behavior,
reduced chance of deep reasoning loops,
and very low probability of context blow-up.

For power-user chat, we want the opposite:

deeper orchestration,
richer tools,
more thinking,
more exploration,
and willingness to spend latency budget for better output.

Those are not separate products. They are separate policies behind the same product capability.

The Real Lesson

The most important part of this bug wasn’t the timeout.

It was the reminder that “one endpoint” and “one behavior” are not the same thing.

A clean system can expose one stable interface while still making careful choices internally about:

latency budget,
fallback strategy,
tool allowance,
reasoning depth,
and loop tolerance.

That is what profiles are for.

Without that distinction, you end up with one of two bad outcomes:

everything is shallow and your serious users get underpowered behavior, or
everything is deep and your demo chat occasionally turns into an accidental stress test.

We were hitting the second case.

The Diagram We Ended Up Using Internally

This is the shortest way to explain the fix to ourselves:

                    ONE PUBLIC CONTRACT

               POST /api/v2/chat
                       │
        ┌──────────────┴──────────────┐
        │                             │
        │ chat_profile = "demo"       │ default / deep
        │                             │
        ▼                             ▼
        fast conversational path       deep orchestrated path
        (internally /api/agent-chat)   (Aeon /v2/chat)
        │                             │
        ├── immediate heartbeats      ├── deeper reasoning
        ├── simpler answer path       ├── tool use
        ├── strong fallbacks          ├── richer telemetry
        └── stable demo UX            └── higher capability ceiling

That is the design in one picture.

What This Avoids

This approach avoids three common failure modes at once.

1. Frontend Routing Sprawl

The website doesn’t need to know whether the request should hit Genesis /chat, Genesis /agent, Aeon /v2/chat, or some future route.

It just declares the chat profile.

2. Capability Regression

We do not have to dumb down /api/v2/chat for every caller just because the public demo needs a safer lane.

The deep path stays available for the places that actually benefit from it.

3. Timeout Theater

A lot of systems “fix” these bugs by stretching timeouts until the UX fails more slowly.

That doesn’t solve the problem.

If the route is making the wrong execution decision, a bigger timeout just gives it more time to be wrong.

The Broader Pattern

This is not just a chat fix. It’s an architectural pattern we’ll keep reusing:

one stable public contract
context-specific execution profiles
policy decided server-side
fallback ladders matched to product intent

That pattern scales much better than proliferating endpoints every time two surfaces need slightly different behavior.

Today it’s demo chat.

Tomorrow it might be:

chat_profile: "developer"
chat_profile: "research"
chat_profile: "operator"
chat_profile: "public_demo"

Same API. Different orchestration policy.

That’s the right level of abstraction.

Conclusion

The public chat wasn’t broken because the stack was down.

It was broken because a simple public question was being routed through a path designed for deeper, more exploratory agent behavior. The system was alive the whole time. It was just being too clever in the wrong place.

So the answer is not “add more timeout.”

The answer is:

keep one endpoint,
add an explicit demo profile,
and let the backend choose the fast path internally.

One public API. Two execution paths. Much less chaos.

This post is part of the AitherOS engineering series on routing, context, and production AI systems. Related reading: The Essay Principle: Why Your AI's Context Window Is an Encyclopedia When It Should Be a Briefing, Your AI Guesses Which Tools to Use. Ours Plays Chess., and AI That Runs What It Writes: Wiring Live Docker Containers Into the Agent Pipeline.

Enjoyed this post?

All posts Try AitherOS