From 15 Seconds to 0.1 Milliseconds: Killing the LLM in the Loop
We have an organic decision engine. It's beautiful in theory: ask the LLM what each agent persona would want to do given their personality, mood, current affect state, and available tasks. A security-focused agent notices a vulnerability scan is overdue. A creative agent gravitates toward content generation when they're in a good mood. A maintenance agent does system checks at 3am when nothing else is happening.
The problem is we were doing this with a 15-second LLM inference call. Inside the kernel tick loop. Which runs every 45 seconds.
The numbers that made us stop
I pulled the Genesis logs after our previous round of fixes (the Gravity Well event loop stall incident). The tick timeout was down to 45 seconds. Good — we'd fixed the HTTP self-calls and the blocking dispatch. But something was still eating 10-15 seconds per tick cycle.
[KERNEL-6P] P3 organic timeout for genesis, using deterministic
[KERNEL-6P] P3 organic timeout for demiurge, using deterministic
model=mistral-nemo:latest
LLM Gateway failed: Request timed out after 709.6s
Three things jumped out:
-
The model was wrong. OrganicDecider resolves
get_default_model()which returnsaither-orchestrator. But the actual model loaded wasmistral-nemo:latest— a 7B model that wasn't even the one we configured. The LLM gateway was silently falling back. -
Most decisions timed out. The 15-second timeout in SixPillarsKernel's
_organic_select_taskwas tripping constantly. The LLM queue was backed up with real user requests. Kernel task selection was getting starved. -
The fallback was fine. Every timeout fell through to
_select_task()— a deterministic function that picks the most overdue task with weighted random selection. It runs in microseconds. And the system worked fine with it.
We were burning 15 seconds of GPU time — blocking the kernel tick loop — to get a result that was marginally different from what a weighted scoring function produces in 0.1 milliseconds. And 90% of the time, the LLM call just failed anyway.
What the LLM was actually doing
Here's the OrganicDecider's decide() flow:
1. Build OrganicContext (fetch affect from Sense via HTTP — 2 new connections)
2. Render a markdown prompt with persona, mood, available actions
3. POST to /v1/chat/completions (the LLM gateway queue)
4. Parse JSON from the LLM response
5. Map action_id back to an AgentTask
6. Return OrganicDecision
Steps 1-2 create HTTP connections on every call. Step 3 waits in the LLM queue behind real user requests. Step 4 parses unstructured text. The prompt looked like:
You are {persona_name}, a {persona_description}.
Your current mood: valence={valence}, arousal={arousal}
Available actions: [list of 8-15 tasks with descriptions]
Pick one and return JSON: {"action_id": "...", "reason": "..."}
This is asking an LLM to do what a scoring function does. Pick one item from a list based on weighted criteria. There's no creative generation. No multi-step reasoning. No nuance that requires language understanding. It's a selection problem — and we were solving it with a generation model.
The replacement: decide_fast()
The new decide_fast() method scores every available action using seven signals and picks via weighted random selection:
def decide_fast(self, ctx: OrganicContext) -> OrganicDecision:
scored = []
for action in ready_actions:
score = 1.0
# 1. Overdue factor (dominant signal, log-scaled)
overdue_ratio = action.last_executed_ago_min / action.interval_minutes
score += math.log1p(overdue_ratio) * 2.0
# 2. Persona domain affinity
if action.domain in agent_preferred_domains:
score += 1.5
# 3. Affect modulation (arousal × domain boost)
score += AROUSAL_BOOST[action.domain] * ctx.arousal
score += VALENCE_BOOST[action.domain] * ctx.valence
# 4. Time-of-day preference
score += TIME_PREFS[ctx.time_of_day].get(action.domain, 0)
# 5. Energy penalty (low energy → avoid high-effort)
if ctx.energy < 30:
score -= (action.effort - 3) * 0.2
# 6. Pain boost (high pain → favor maintenance)
if ctx.system_pain > 0.5:
if action.domain in (MAINTENANCE, SYSTEM, SECURITY):
score += ctx.system_pain * 1.5
# 7. Floor at 0.1 (keep all options alive)
scored.append((max(score, 0.1), action))
# Weighted random selection
return weighted_random_pick(scored)
The mapping tables (AROUSAL_BOOST, VALENCE_BOOST, TIME_PREFERENCE) encode the same intuitions the LLM prompt was trying to elicit — but deterministically, with zero inference cost.
Key insight: the LLM was implementing a scoring function using natural language. We just wrote the scoring function directly.
Results
| Metric | LLM path | decide_fast() |
|---|---|---|
| Latency | 10,000-15,000ms | 0.1ms |
| Success rate | ~10% (timeouts) | 100% |
| GPU cost per tick | ~15s inference | 0 |
| HTTP connections | 3+ new per call | 0 |
| Persona awareness | Yes (when it works) | Yes (domain affinity tables) |
| Affect modulation | Yes (when it works) | Yes (arousal/valence weights) |
The live logs confirmed immediately:
⚡ [genesis] Fast organic: codegraph_health_check
(overdue 999m, persona match) [0.1ms]
⚡ [demiurge] Fast organic: social_session
(overdue 999m, persona match) [0.1ms]
⚡ [saga] Fast organic: auto_scene_visualization
(overdue 999m, persona match) [0.1ms]
100,000x speedup. Every organic agent now gets scored on every tick, not just the first two (we had a _MAX_ORGANIC_PER_TICK = 2 rate limit to contain the LLM cost — now unnecessary).
What we lost
decide_fast() is better in every measurable way. But it's myopic. It scores each action independently and picks one. It can't reason about sequences:
- "If I do maintenance now, I'll have energy for creative work next tick"
- "Security is overdue but not critical — I can handle two code tasks first"
- "Pain is rising. Better do system checks now before it cascades into a circuit breaker trip"
The LLM theoretically could have done this. In practice, it never did — the prompts weren't structured for sequential reasoning, and the model was too small for reliable multi-step planning anyway.
But the question remains: can we get sequence-aware planning without going back to LLM inference?
The next step: Monte Carlo Tree Search
MCTS is the answer. It's how AlphaGo plans moves — not by evaluating each move independently, but by simulating sequences of moves and propagating results back through a search tree. The key insight is that decide_fast() is an excellent rollout policy for MCTS.
Here's the architecture:
State
@dataclass
class SimState:
energy: float # Agent energy remaining
pain: float # System pain level
cooldowns: dict # action_id → seconds until available
last_executed: dict # action_id → virtual timestamp
tick: int # Steps from root
Simulation
Each step in the simulation:
- Pick an action from available (non-cooldown) tasks
- Deduct energy proportional to effort
- Set the action's cooldown
- Adjust pain: maintenance/system tasks reduce it, others let it drift up
- Advance time: all cooldowns tick down
Search
Root = current agent state
Repeat 64 times:
1. SELECT: Walk down tree using UCB1
(balance exploitation of best-known vs exploration of untried)
2. EXPAND: Add child node for an untried action
3. ROLLOUT: Simulate H steps using decide_fast scoring as the policy
4. BACKUP: Propagate the terminal reward up the tree
Return: the root action with the most visits
Reward function
At the end of a rollout (H=5 steps):
reward = (
coverage_score # Were all task types attended to?
+ urgency_score # Were critically overdue tasks handled?
+ pain_trajectory # Did pain decrease over the sequence?
+ energy_efficiency # Did we avoid depleting energy?
+ persona_alignment # Did we stay in our work_type lane?
)
The key trick: decide_fast as rollout policy
MCTS needs a rollout policy — a cheap function to play out hypothetical futures. decide_fast() is exactly that. It already encodes persona affinity, affect modulation, and urgency weighting. Using it as the rollout policy means every simulated future is persona-consistent, not random.
This is the same pattern that made AlphaGo work: a strong policy network makes rollouts realistic, so the tree search converges quickly.
Performance budget
- 64 iterations × 5-step horizon = 320 state transitions
- Each transition: a few dict operations ≈ 100ns
- Scoring per step: ~1μs (the fast scorer)
- Total: ~5-10ms
That's 100x more than decide_fast() alone, but still 1,500x faster than the LLM path. And unlike the LLM, it actually reasons about sequences.
When to use MCTS vs fast
Not every tick needs sequence planning. The decision of when to think harder is itself a decision:
| Condition | Strategy |
|---|---|
| Pain > 0.6 | MCTS — system stressed, need strategic recovery |
| Energy < 30% | MCTS — budget remaining actions carefully |
| 3+ tasks with overdue > 3× interval | MCTS — competing priorities need sequencing |
| Normal operation | Fast — any reasonable choice is fine |
This is effort-scaled decision making. Trivial situations get trivial computation. Complex situations get deeper search. The system thinks harder when it matters — without ever blocking the event loop.
The progression
LLM inference: 15,000ms → mostly fails, blocks event loop
Fast scoring: 0.1ms → persona-aware, instant, no GPU
MCTS + fast: ~5-10ms → sequence-aware, still no GPU
Three layers, each building on the last:
- The scorer (
decide_fast) encodes domain knowledge as weighted signals - The tree search (
decide_mcts) uses the scorer to simulate futures - The meta-decision chooses which strategy to use based on situation complexity
No LLM in the loop. No HTTP calls. No GPU. Just data structures and arithmetic, encoding the same intuitions that the LLM prompt was trying to approximate — but reliably, deterministically, and 100,000 times faster.
The lesson
The reflex to "use an LLM for it" is strong. LLMs are genuinely magical for tasks that require language understanding, creative generation, and nuanced reasoning. But task selection from a finite list is not one of those tasks. It's a scoring problem. A search problem. A classical AI problem.
When you find yourself writing a prompt that says "pick one from this list based on these criteria" — stop. Write the scoring function. It'll be faster, more reliable, and more debuggable. Save the LLM for problems that actually need it.
Then, if you need sequence reasoning, reach for MCTS. It's been solving planning problems since 2006. It'll solve yours too — in 5 milliseconds instead of 15 seconds.
The OrganicDecider changes are live in Genesis. decide_fast() handles all kernel tick decisions. The MCTS layer (decide_mcts()) is implemented and available for high-stakes decisions. The LLM path (decide()) is preserved for social and innerlife decisions where personality-driven prose generation genuinely matters.