The Genesis Gravity Well: When Your Orchestrator Becomes a Bottleneck
We have a good architecture. AitherPulse is the heartbeat. Flux is the event bus. Genesis is the orchestrator — it subscribes to events and coordinates the system. The data flow is clean: services emit events to Pulse, Pulse forwards through Flux, subscribers react asynchronously. Genesis subscribes, plans, dispatches. It should never be in the critical path of anything except its own tick loop.
Except it was. In the critical path of everything.
The symptom
Genesis went unhealthy. Not a crash — worse. The container stayed up, the process kept running, but the HTTP server stopped responding. curl http://localhost:8001/health returned HTTP 000. Connection refused on a process that was demonstrably alive.
The logs told the story:
EVENT LOOP STALL #33: blocked for 29.5s beyond expected 5.0s (total=34.5s)
Running tasks: ['GenesisOrchestrator._registry_sync_loop',
'ContextWindowOptimizer._optimization_loop', 'LLMQueue._vram_rebalance_loop',
'NeuronDaemon._preemptive_firing_loop', '_event_loop_watchdog',
'RoutinesManager._process_routine_queue', ...15 more...]
[KERNEL] ⏰ Dispatch TIMEOUT genesis:context_engine_benchmark — task blocked
the tick loop (effort=6). This is why agents stop posting!
[KERNEL] TICK TIMEOUT — exceeded 180s!
The watchdog had fired 33 times. The tick loop — a 5-second heartbeat that should complete in under a second — was taking over 3 minutes. The cascading failure was textbook: blocked tick → blocked dispatch → blocked LLM calls → blocked health checks → Docker marks container unhealthy → Constellation loses Genesis → "Limited mode" banner on the dashboard.
The real problem: architectural drift
The root cause wasn't a bug. It was a slow accumulation of shortcuts.
We built the pub/sub architecture correctly. Pulse emits, Flux routes, subscribers react. But over three months of rapid feature development — shipping 29 agent personas, an organic decision engine, a six-pillar kernel, a neuron daemon — new code kept taking the path of least resistance: calling Genesis directly over HTTP instead of publishing events to the bus.
Here's what I found when I audited the call graph:
Genesis was calling itself over HTTP
The AgentKernel runs inside Genesis. It was absorbed into the Genesis process in February. But the kernel still had methods that made HTTP calls back to Genesis's own endpoints:
# _refresh_system_snapshot — HTTP GET to ourselves
async def _refresh_system_snapshot(self) -> None:
url = self._get_genesis_url()
resp = await self._http.get(f"{url}/snapshot", timeout=5.0)
Genesis's event loop makes an HTTP request. Genesis's HTTP server must handle it. But the event loop is busy making the request. Deadlock under load.
The _check_microscheduler_capacity method had a similar self-referencing fallback:
# Fallback: HTTP to Genesis /llm/queue/status
url = self._get_genesis_url()
resp = await self._http.get(f"{url}/llm/queue/status", timeout=5.0)
Since MicroScheduler was absorbed into Genesis, this is literally Genesis asking Genesis about its own queue depth over HTTP.
Neuron events created a new HTTP client per fire
The NeuronDaemon fires 20+ neurons on timer-based schedules. Every single neuron fire created a brand-new httpx.AsyncClient, opened a TCP connection to Pulse, POSTed an event, and tore down the connection:
# BEFORE — new TCP connection per neuron fire
async with httpx.AsyncClient(timeout=2.0) as client:
await client.post(f"{pulse_url}/events/emit", json=event_data)
Multiply that by 20 neurons firing every 15–60 seconds. That's hundreds of TCP connection setup/teardown cycles per minute, all competing with the Genesis event loop for scheduling time. And the NeuronDaemon had zero Flux integration despite running inside a process that already had FluxEmitter initialized.
Dispatch blocked the tick loop
The most devastating pattern: the kernel tick loop awaited every dispatch to completion before moving to the next tick. A dispatch might route through an LLM call that takes 60–120 seconds. The tick loop used asyncio.gather(*dispatch_tasks) — meaning it waited for all dispatches before proceeding:
# BEFORE — tick loop blocked until all LLM calls complete
if dispatch_tasks:
await asyncio.gather(*dispatch_tasks, return_exceptions=True)
With 3 agents dispatched at effort levels 6-10, that's three LLM calls at 30-120 seconds each, all blocking the tick. During those minutes, no health checks, no heartbeats, no event processing. The watchdog screams. Docker gives up.
The architectural rule we violated
The rule is simple: Genesis subscribes. It never serves.
Genesis should receive events from Flux and make decisions based on them. It should never be in the HTTP request path of other services or — even worse — in its own request path. Every time we wrote await self._http.get(genesis_url/...) inside Genesis, we created a self-referencing loop that turned our orchestrator into a bottleneck.
The Flux event bus exists precisely to decouple this. Services emit events. Subscribers react asynchronously. No service should need to know or care whether Genesis is healthy to do its job. And Genesis should never need to call itself to check its own state.
The fix
Four changes, applied in sequence:
1. In-process state access instead of HTTP self-calls
The snapshot and queue depth methods now access the orchestrator's state directly:
# AFTER — in-process, zero network overhead
async def _refresh_system_snapshot(self) -> None:
try:
from AitherGenesis.genesis_service import orchestrator as _orch
snapshot = {
"status": "ok",
"services": {},
"health": getattr(_orch.state, "health", "unknown"),
}
for name, svc in getattr(_orch.state, "services", {}).items():
snapshot["services"][name] = {
"status": getattr(svc, "status", "unknown"),
"health_score": getattr(svc, "health_score", None),
}
self.state.system_snapshot = snapshot
self.state.snapshot_fetched_at = now
except Exception:
logger.debug("[KERNEL] Snapshot: orchestrator not importable, skipping")
No HTTP. No event loop contention. The data is in the same process — just read it.
The MicroScheduler capacity check similarly dropped its HTTP fallback entirely. If the in-process AitherLLMQueue isn't available, mark it offline instead of self-calling:
# AFTER — in-process only, never HTTP self-call
try:
from lib.cognitive.AitherLLMQueue import get_queue, get_vllm_pool
queue = get_queue()
self.state.microscheduler_queue_depth = queue.get_stats().queued
self.state.microscheduler_online = True
except Exception:
self.state.microscheduler_online = False # Don't HTTP self-call
2. Neuron events through FluxEmitter
The NeuronDaemon now uses the FluxEmitter singleton instead of creating per-event HTTP clients:
# AFTER — FluxEmitter pub/sub, no TCP overhead
from lib.core.FluxEmitter import FluxEmitter, EventType
_neuron_flux_emitter = FluxEmitter("NeuronDaemon", auto_register=False)
async def emit_neuron_event(event_type, neuron_type, query, ...):
if _FLUX_EMIT_AVAILABLE and _neuron_flux_emitter:
_neuron_flux_emitter.emit(
EventType.HEARTBEAT,
component="NeuronDaemon",
event=f"neuron.{event_type}",
**event_data,
)
return
# HTTP fallback only if Flux genuinely unavailable
Events flow through Flux → Pulse → dashboards automatically. No new TCP connections. No event loop contention.
3. Fire-and-forget dispatch
The dispatch pattern changed from blocking gather to background tasks with a short wait:
# AFTER — fire-and-forget with 10s/15s grace period
bg_tasks = [asyncio.create_task(t) for t in dispatch_tasks]
_done, _pending = await asyncio.wait(
bg_tasks, timeout=15.0, return_when=asyncio.ALL_COMPLETED
)
if _pending:
logger.info(
f"[KERNEL] {len(_pending)} dispatch(es) continuing in background"
)
Fast tasks (heartbeats, status checks) complete within 15 seconds and get tracked normally. Slow tasks (LLM generation, code review) continue in the background. The tick loop moves on. The event loop stays responsive.
4. Tick timeout from 180s to 45s
With dispatches no longer blocking, the tick should never take 180 seconds. We dropped the timeout to 45 seconds — enough for the organic decision engine's LLM calls but tight enough to catch real stalls early.
The organic decision engine (it's not MCTS)
A question that came up during the fix: the organic planner takes 10-15 seconds per agent decision. Is it doing Monte Carlo Tree Search?
No. The OrganicDecider is simpler and arguably more interesting. It's a direct LLM call with persona injection. For each agent in the tick, it:
- Builds an
OrganicContextwith the agent's personality, mood, available actions, and system state - Renders a prompt that says "Given who you are and what you see, what would you do?"
- Sends it to the orchestrator model (Qwen 32B via vLLM) at temperature 0.85
- Parses the JSON response for an
action_idandreason
It's not tree search — it's structured generation with persona grounding. The 10-15 second latency comes from the LLM inference itself, not from search or simulation. The organic decisions replace what used to be random.choice() calls that picked agent tasks by probability. Now each agent genuinely reasons about what to do based on its personality and the system state.
Deterministic overrides still trump the LLM: coordinated tasks from routines.yaml always win, rate limits are hard constraints, pain gating prevents dispatch during stress, and budget caps are enforced after the decision. The LLM provides the creative direction; the kernel enforces the guardrails.
The metrics
Before the fix:
| Metric | Before |
|---|---|
| Event loop stalls | 33 in 6 hours |
| Tick duration | 180s+ (hard timeout) |
| Iteration cycle | 180s+ (blocked on dispatch) |
| Health endpoint | HTTP 000 (unreachable) |
| Docker health | unhealthy |
| Constellation | "Limited mode" |
| Dispatch pattern | await gather() — blocking |
After the fix:
| Metric | After |
|---|---|
| Event loop stalls | 0 |
| Tick duration | ~35s (organic planning) |
| Iteration cycle | 35-40s (non-blocking) |
| Health endpoint | HTTP 200 in <100ms |
| Docker health | healthy |
| Constellation | Full agent discovery |
| Dispatch pattern | create_task() + wait(timeout=15) |
The tick still takes ~35 seconds because of the organic decision LLM calls (10-15s per agent × 2-3 agents). That's irreducible latency from inference. But the critical change is that this no longer blocks the event loop — health checks respond, dispatches run in background, heartbeats flow, and the watchdog stays quiet.
The lesson
Build your pub/sub architecture. Document it. Then enforce it in code review.
We had the right design from day one: Pulse emits, Flux routes, Genesis subscribes. But rapid feature development creates gravitational pull toward the orchestrator. When you're inside Genesis adding a new feature, the tempting thing is to call another Genesis endpoint directly — the code is right there, the URL is localhost, the response is immediate. What you don't see is that you're adding another blocking call to an event loop that's already managing 15+ concurrent background tasks, 20+ neuron timers, an SSE stream, and a scheduler loop.
The fix for architectural drift isn't better documentation (though we updated ours). It's structural:
- In-process state is free. If two components live in the same process, they should share memory, not HTTP.
- Events are decoupled. If something can be an event, make it an event. Don't await the response.
- Dispatches are fire-and-forget. If a task takes longer than your tick interval, it belongs in the background.
- Self-referencing HTTP is always a bug. A process should never make HTTP calls to its own endpoints.
We're also adding a linter rule: if any code inside AitherGenesis/ contains self._http.get(f"{self._get_genesis_url()} it fails CI. The gravitational well has guardrails now.