103 Seconds to Answer 'How Many R's in Strawberry?' — Fixing the Genesis Lockup
We ran a benchmark. Fifty trick questions — the kind designed to make LLMs think before answering. "How many r's in strawberry?" "Is it legal for a man to marry his widow's sister?" "What weighs more, a pound of feathers or a pound of bricks?"
The first question took 103 seconds to get a response. During those 103 seconds, curl https://localhost:8001/health returned nothing. Connection timeout. Genesis — the brain of the entire system — was completely unresponsive. One user asking one question shut down the API for nearly two minutes.
The fifty-question benchmark? It would have taken 85 minutes. With zero concurrent request capacity the entire time.
We fixed it. The same question now takes 19 seconds. Health checks respond in 0.2 seconds during processing. Eight workers handle concurrent requests. Here's exactly what was wrong and how we fixed it.
The 97-Second Mystery
The actual LLM inference — vLLM generating a response via MicroScheduler — took 5.9 seconds. That's the part that actually requires compute. The other 97 seconds were pure waste.
Genesis processes chat through a 30-middleware pipeline. This is by design: the DeerFlow pattern chains pre-processing (context assembly, memory recall, safety checks), generation (LLM call), and post-processing (memory storage, reflection, logging) as sequential async middleware. Each middleware gets a state object, does its work, and passes to next(state).
The pipeline is async. In theory, a single uvicorn event loop should handle many concurrent requests — while one request awaits an HTTP call, others run. In practice, two middlewares were making sequential HTTP calls to downstream services that were DOWN, each with 30-second default timeouts. The await was non-blocking at the event loop level, but the request was blocked for 100+ seconds before returning.
ConversationMemoryMiddleware: 77 seconds
The worst offender. This middleware does two things:
- Pre-generation recall — asks Spirit (port 8105) for similar past conversations to inject as context
- Post-generation storage — stores the exchange through UnifiedKnowledgeLayer, which calls Spirit, Mind, Nexus, and the embedding service sequentially
The recall hit Spirit over HTTP. Spirit was down. 30-second timeout. Then the response came back, the LLM generated in 5.9 seconds, and the storage phase kicked in: ingest_knowledge() tried Spirit (30s timeout), then Nexus (30s timeout), then the fallback Spirit teach() call (30s timeout).
That's 77 seconds of waiting for services to not respond, after the user's answer was already generated.
ContextPipelineMiddleware: 20 seconds
The 12-stage context assembly pipeline runs even for effort=1 trivial questions. For "how many r's in strawberry?", it was assembling full neuron context, memory recalls, and Flux state — all hitting services that returned slowly or not at all.
The Fixes
Five changes, each targeting a specific bottleneck. No architectural rewrites. No middleware removal. Just timeouts and smart gating.
1. Gunicorn with 8 Workers
The headline change. Genesis was running a single uvicorn worker. One worker means one process — if that process is busy with a chat request, health checks queue behind it.
We tried uvicorn's built-in workers=4 parameter first. It requires passing the app as an import string instead of an object, and uvicorn uses Python's multiprocessing.spawn() to create child processes. Every child has to re-import the entire 6,400-line genesis_service.py module — and Genesis has enough module-level state that the children kept crashing silently.
The fix was gunicorn. The standard production deployment pattern for FastAPI:
# gunicorn.conf.py
import multiprocessing
import os
worker_class = "uvicorn.workers.UvicornWorker"
_cpu_count = multiprocessing.cpu_count() or 4
_default_workers = min(_cpu_count, 8)
workers = int(os.environ.get("GENESIS_WORKERS", _default_workers))
timeout = 120 # Must be > longest request (90s chat timeout)
graceful_timeout = 30
keep_alive = 120
bind = f"0.0.0.0:{os.environ.get('GENESIS_PORT', '8001')}"
Gunicorn manages worker processes properly — it handles the fork lifecycle, signal propagation, and worker restarts. On our Ryzen 9 9950X3D (24 cores visible to Docker), it auto-scales to 8 workers. Override with GENESIS_WORKERS=N in docker-compose.
Eight workers means eight concurrent requests before any queuing. And since each worker runs an async event loop, each can handle many concurrent async operations internally. The practical limit is now the LLM backend, not Genesis.
2. 90-Second Timeout on engine.process()
The /chat endpoint previously had no timeout:
# Before
result = await engine.process(chat_request)
If the middleware chain hung — a misconfigured service, an infinite retry loop, a deadlocked HTTP client — the request would block forever, consuming a worker permanently.
# After
result = await asyncio.wait_for(engine.process(chat_request), timeout=90.0)
Now: 90-second hard cap. Returns a 504 with a clear error message instead of timing out at the TCP level. The client gets a proper response, the worker is freed, and the error is logged.
Why 90 seconds and not 30? Agentic multi-turn with tool calls can legitimately take 60+ seconds. The benchmark uses a 180-second client timeout, so a 90-second server timeout means the client always gets a proper 504 instead of a connection timeout.
3. Fire-and-Forget Memory Storage
This was the biggest single improvement: 70 seconds saved per request.
The post-generation store_exchange() call was blocking the response. The user already had their answer — the LLM had generated it, the sanitization middleware had cleaned it, the response was ready to send. But the middleware chain is sequential, so the response sat there while store_exchange() tried to write to three backends that weren't responding.
# Before: synchronous — blocks response delivery
await mem.store_exchange(
user_message=user_message[:1000],
agent_response=result.response[:1000],
)
# After: fire-and-forget — response returns immediately
asyncio.create_task(
_safe_store_exchange(mem, user_message[:1000], result.response[:1000])
)
The _safe_store_exchange wrapper has its own 15-second timeout and catches all exceptions. If storage fails, it logs a warning and moves on. The user never knows, and the response is never delayed.
This is the right tradeoff. Conversation memory is valuable but optional. A failed store means one exchange isn't recalled in future conversations — not a system failure. A 70-second delay on every response is a system failure.
4. Timeout on Memory Recall
The pre-generation recall also blocked, but for shorter — typically 5-30 seconds depending on Spirit's response time.
# Before
ctx = await mem.get_context_for_response(user_message)
# After
ctx = await asyncio.wait_for(
mem.get_context_for_response(user_message),
timeout=5.0,
)
Five-second timeout on an optional enrichment. If Spirit can't respond in 5 seconds, it's DOWN or overloaded — skip it. Other middlewares in the ContextPipeline already use 4-5 second timeouts for their sources; ConversationMemory was the outlier.
We also added matching timeouts inside ConversationMemory.py itself:
ingest_knowledge(): 10-second timeout- Spirit
teach()fallback: 10-second timeout - Spirit
recall(): 5-second timeout
Each with proper asyncio.TimeoutError handlers that log warnings and return gracefully.
5. Effort Gating
The simplest optimization: don't do expensive work for cheap questions.
async def __call__(self, state, next):
effort = getattr(state, "effort_level", None) or (
state.metadata.get("effort_level") if hasattr(state, "metadata") else None
)
if effort is not None and effort <= 2:
return await next(state)
Effort 1-2 covers "hi", "thanks", "what time is it?", and other trivial messages. These get classified by the IntentEngine in under 1ms. They don't need conversation memory recall (5-30 seconds of latency) or storage (10-70 seconds of latency). The middleware short-circuits directly to the next in chain.
This is part of a broader pattern in the PillarWiring layer: different effort levels get different execution plans. Effort 1-2 skips planning, RLM, and delegation middlewares. Effort 3-6 gets the full pipeline. Effort 7-10 adds MCTS pre-planning and reasoning model dispatch. The ConversationMemoryMiddleware just wasn't wired into this effort gating — now it is.
The MCTS Connection
This fix exposed an interesting interaction with the MCTS planning layer.
PillarWiring's wire_intent_to_plan() takes an intent classification and effort level and produces an execution plan. That plan includes reasoning_depth (skip/gate/light/sase/sase_verify) and context_depth (axioms/fast/full/full_graph/all_layers). These depth levels control which middlewares run and how much context they assemble.
The problem was that ConversationMemoryMiddleware wasn't reading the execution plan at all. It ran the full recall+store cycle regardless of what the planner decided.
Now it respects the effort boundary. And the timeout changes mean that even when it does run (effort >= 3), it can't block the pipeline for more than 5 seconds on recall. The MCTS planner can safely include conversation memory in its full and full_graph context depths knowing it won't blow the time budget.
The scale_mcts_config() function in PillarWiring dynamically adjusts MCTS iteration count and time limits based on effort:
| Effort | MCTS Iterations | Time Limit | Exploration Weight |
|---|---|---|---|
| 1 | 20 | 0.2s | 1.8 (high) |
| 5 | 100 | 0.8s | 1.2 (balanced) |
| 10 | 300 | 5.0s | 0.6 (exploitative) |
Low effort = few iterations, high exploration (find a good-enough answer fast). High effort = many iterations, low exploration (deeply evaluate the best paths). The middleware gating aligns with this: there's no point running 300 MCTS iterations to decide how to plan "hi".
The Numbers
Before:
- Single question: 103 seconds
- Health check during chat: timeout (connection refused)
- Concurrent requests: 0 (fully blocked)
- 50-question benchmark: 85+ minutes (estimated)
- Throughput: 0.6 requests/minute
After:
- Single question: 19 seconds (5.4x faster)
- Health check during chat: 0.2 seconds
- Concurrent requests: 8+ (gunicorn workers) x async (hundreds of pending I/O ops per worker)
- Workers: 8 (auto-scaled, configurable via GENESIS_WORKERS)
- Throughput: ~25 requests/minute for simple questions, 3-4 concurrent deep chats
Where the time went:
| Component | Before | After | Savings |
|---|---|---|---|
| ConversationMemory recall | 5-30s | 0-5s (timeout) | 25s |
| LLM generation | 5.9s | 5.9s | 0s |
| ConversationMemory store | 10-70s | 0s (fire-and-forget) | 70s |
| Context pipeline | 20s | 5-10s (effort gating) | 10s |
| Total | 103s | ~19s | 84s |
What We Didn't Do
This fix is deliberately minimal. We didn't:
- Remove middlewares — all 30 stay. The pipeline structure is correct; the problem was specific middlewares with no timeouts.
- Rewrite UnifiedKnowledgeLayer — the 30-second HTTP timeouts inside it are still there. Our changes wrap them at the caller level with tighter timeouts.
- Add worker offloading — Genesis already has an AitherWorker container (port 8159) that runs background loops (SchedulerLoop, NeuronDaemon, JarvisBrain, CodeGraph). Chat processing stays in Genesis because it needs the full middleware chain. Offloading would require duplicating the pipeline.
- Implement request queuing — gunicorn handles this natively. Excess requests queue at the arbiter level.
The Lesson
The architecture was right. The implementation was right. The missing piece was defensive timeouts on optional operations.
In a microservice system, every HTTP call to another service is a potential 30-second block. When you chain four of those sequentially in a post-processing middleware, you get a 120-second tax on every request — even though the actual work (LLM generation) took 6 seconds.
The fix is almost embarrassingly simple: asyncio.wait_for() with sensible timeouts, asyncio.create_task() for fire-and-forget side effects, and effort-based gating to skip expensive operations on cheap questions.
The harder lesson is about uvicorn workers. The "just increase workers=N" advice you'll find in every FastAPI tutorial doesn't work when your application has complex module-level initialization. Genesis is 6,400 lines with 50+ router imports, singleton initialization, signal handlers, and a lifespan context manager that starts an orchestrator. uvicorn's built-in multiprocessing spawns fresh processes that re-import all of that — and three out of four silently crashed.
Gunicorn exists because this problem is old. Its process management is battle-tested: proper fork lifecycle, worker monitoring with heartbeat-based kill/restart, graceful shutdown, signal forwarding, and a post_fork hook for per-worker initialization. The switch from uvicorn.run(app, workers=4) to gunicorn -k uvicorn.workers.UvicornWorker -w 8 was the difference between "workers keep dying" and "8 workers, zero errors."
Try It
If you're running AitherOS, the fix is already deployed:
# Pull latest
git pull
# Rebuild Genesis (picks up gunicorn + timeout fixes)
.DEPLOYMENT/scripts/compose.sh aitheros up -d --build aither-genesis
# Verify workers
docker logs aitheros-genesis | grep "Booting worker"
# Override worker count
# In .DEPLOYMENT/compose/docker-compose.aitheros.yml, add to Genesis environment:
# GENESIS_WORKERS: "16"
Run the benchmark yourself:
cd AitherOS
python dev/benchmarks/trick_question_benchmark.py --target aitheros
Health checks should respond instantly, even under full benchmark load. That's the whole point.