We have an organic decision engine. It's beautiful in theory: ask the LLM what each agent persona would want to do given their personality, mood, current affect state, and available tasks. A security-focused agent notices a vulnerability scan is overdue. A creative agent gravitates toward content generation when they're in a good mood. A maintenance agent does system checks at 3am when nothing else is happening.

The problem is we were doing this with a 15-second LLM inference call. Inside the kernel tick loop. Which runs every 45 seconds.

The numbers that made us stop

I pulled the Genesis logs after our previous round of fixes (the Gravity Well event loop stall incident). The tick timeout was down to 45 seconds. Good — we'd fixed the HTTP self-calls and the blocking dispatch. But something was still eating 10-15 seconds per tick cycle.

[KERNEL-6P] P3 organic timeout for genesis, using deterministic
[KERNEL-6P] P3 organic timeout for demiurge, using deterministic  
model=mistral-nemo:latest
LLM Gateway failed: Request timed out after 709.6s

Three things jumped out:

The model was wrong. OrganicDecider resolves get_default_model() which returns aither-orchestrator. But the actual model loaded was mistral-nemo:latest — a 7B model that wasn't even the one we configured. The LLM gateway was silently falling back.
Most decisions timed out. The 15-second timeout in SixPillarsKernel's _organic_select_task was tripping constantly. The LLM queue was backed up with real user requests. Kernel task selection was getting starved.
The fallback was fine. Every timeout fell through to _select_task() — a deterministic function that picks the most overdue task with weighted random selection. It runs in microseconds. And the system worked fine with it.

We were burning 15 seconds of GPU time — blocking the kernel tick loop — to get a result that was marginally different from what a weighted scoring function produces in 0.1 milliseconds. And 90% of the time, the LLM call just failed anyway.

What the LLM was actually doing

Here's the OrganicDecider's decide() flow:

1. Build OrganicContext (fetch affect from Sense via HTTP — 2 new connections)
2. Render a markdown prompt with persona, mood, available actions  
3. POST to /v1/chat/completions (the LLM gateway queue)
4. Parse JSON from the LLM response
5. Map action_id back to an AgentTask
6. Return OrganicDecision

Steps 1-2 create HTTP connections on every call. Step 3 waits in the LLM queue behind real user requests. Step 4 parses unstructured text. The prompt looked like:

You are {persona_name}, a {persona_description}.
Your current mood: valence={valence}, arousal={arousal}
Available actions: [list of 8-15 tasks with descriptions]
Pick one and return JSON: {"action_id": "...", "reason": "..."}

This is asking an LLM to do what a scoring function does. Pick one item from a list based on weighted criteria. There's no creative generation. No multi-step reasoning. No nuance that requires language understanding. It's a selection problem — and we were solving it with a generation model.

The replacement: decide_fast()

The new decide_fast() method scores every available action using seven signals and picks via weighted random selection:

def decide_fast(self, ctx: OrganicContext) -> OrganicDecision:
    scored = []
    for action in ready_actions:
        score = 1.0
        
        # 1. Overdue factor (dominant signal, log-scaled)
        overdue_ratio = action.last_executed_ago_min / action.interval_minutes
        score += math.log1p(overdue_ratio) * 2.0
        
        # 2. Persona domain affinity
        if action.domain in agent_preferred_domains:
            score += 1.5
        
        # 3. Affect modulation (arousal × domain boost)
        score += AROUSAL_BOOST[action.domain] * ctx.arousal
        score += VALENCE_BOOST[action.domain] * ctx.valence
        
        # 4. Time-of-day preference
        score += TIME_PREFS[ctx.time_of_day].get(action.domain, 0)
        
        # 5. Energy penalty (low energy → avoid high-effort)
        if ctx.energy < 30:
            score -= (action.effort - 3) * 0.2
        
        # 6. Pain boost (high pain → favor maintenance)
        if ctx.system_pain > 0.5:
            if action.domain in (MAINTENANCE, SYSTEM, SECURITY):
                score += ctx.system_pain * 1.5
        
        # 7. Floor at 0.1 (keep all options alive)
        scored.append((max(score, 0.1), action))
    
    # Weighted random selection
    return weighted_random_pick(scored)

The mapping tables (AROUSAL_BOOST, VALENCE_BOOST, TIME_PREFERENCE) encode the same intuitions the LLM prompt was trying to elicit — but deterministically, with zero inference cost.

Key insight: the LLM was implementing a scoring function using natural language. We just wrote the scoring function directly.

Results

Metric	LLM path	decide_fast()
Latency	10,000-15,000ms	0.1ms
Success rate	~10% (timeouts)	100%
GPU cost per tick	~15s inference	0
HTTP connections	3+ new per call	0
Persona awareness	Yes (when it works)	Yes (domain affinity tables)
Affect modulation	Yes (when it works)	Yes (arousal/valence weights)

The live logs confirmed immediately:

⚡ [genesis] Fast organic: codegraph_health_check 
  (overdue 999m, persona match) [0.1ms]
⚡ [demiurge] Fast organic: social_session 
  (overdue 999m, persona match) [0.1ms]  
⚡ [saga] Fast organic: auto_scene_visualization 
  (overdue 999m, persona match) [0.1ms]

100,000x speedup. Every organic agent now gets scored on every tick, not just the first two (we had a _MAX_ORGANIC_PER_TICK = 2 rate limit to contain the LLM cost — now unnecessary).

What we lost

decide_fast() is better in every measurable way. But it's myopic. It scores each action independently and picks one. It can't reason about sequences:

"If I do maintenance now, I'll have energy for creative work next tick"
"Security is overdue but not critical — I can handle two code tasks first"
"Pain is rising. Better do system checks now before it cascades into a circuit breaker trip"

The LLM theoretically could have done this. In practice, it never did — the prompts weren't structured for sequential reasoning, and the model was too small for reliable multi-step planning anyway.

But the question remains: can we get sequence-aware planning without going back to LLM inference?

The next step: Monte Carlo Tree Search

MCTS is the answer. It's how AlphaGo plans moves — not by evaluating each move independently, but by simulating sequences of moves and propagating results back through a search tree. The key insight is that decide_fast() is an excellent rollout policy for MCTS.

Here's the architecture:

State

@dataclass
class SimState:
    energy: float          # Agent energy remaining
    pain: float            # System pain level
    cooldowns: dict        # action_id → seconds until available
    last_executed: dict    # action_id → virtual timestamp
    tick: int              # Steps from root

Simulation

Each step in the simulation:

Pick an action from available (non-cooldown) tasks
Deduct energy proportional to effort
Set the action's cooldown
Adjust pain: maintenance/system tasks reduce it, others let it drift up
Advance time: all cooldowns tick down

Search

Root = current agent state

Repeat 64 times:
  1. SELECT:  Walk down tree using UCB1
              (balance exploitation of best-known vs exploration of untried)
  2. EXPAND:  Add child node for an untried action  
  3. ROLLOUT: Simulate H steps using decide_fast scoring as the policy
  4. BACKUP:  Propagate the terminal reward up the tree

Return: the root action with the most visits

Reward function

At the end of a rollout (H=5 steps):

reward = (
    coverage_score      # Were all task types attended to?
    + urgency_score     # Were critically overdue tasks handled?
    + pain_trajectory   # Did pain decrease over the sequence?
    + energy_efficiency # Did we avoid depleting energy?
    + persona_alignment # Did we stay in our work_type lane?
)

The key trick: decide_fast as rollout policy

MCTS needs a rollout policy — a cheap function to play out hypothetical futures. decide_fast() is exactly that. It already encodes persona affinity, affect modulation, and urgency weighting. Using it as the rollout policy means every simulated future is persona-consistent, not random.

This is the same pattern that made AlphaGo work: a strong policy network makes rollouts realistic, so the tree search converges quickly.

Performance budget

64 iterations × 5-step horizon = 320 state transitions
Each transition: a few dict operations ≈ 100ns
Scoring per step: ~1μs (the fast scorer)
Total: ~5-10ms

That's 100x more than decide_fast() alone, but still 1,500x faster than the LLM path. And unlike the LLM, it actually reasons about sequences.

When to use MCTS vs fast

Not every tick needs sequence planning. The decision of when to think harder is itself a decision:

Condition	Strategy
Pain > 0.6	MCTS — system stressed, need strategic recovery
Energy < 30%	MCTS — budget remaining actions carefully
3+ tasks with overdue > 3× interval	MCTS — competing priorities need sequencing
Normal operation	Fast — any reasonable choice is fine

This is effort-scaled decision making. Trivial situations get trivial computation. Complex situations get deeper search. The system thinks harder when it matters — without ever blocking the event loop.

The progression

LLM inference:   15,000ms → mostly fails, blocks event loop
Fast scoring:         0.1ms → persona-aware, instant, no GPU
MCTS + fast:        ~5-10ms → sequence-aware, still no GPU

Three layers, each building on the last:

The scorer (decide_fast) encodes domain knowledge as weighted signals
The tree search (decide_mcts) uses the scorer to simulate futures
The meta-decision chooses which strategy to use based on situation complexity

No LLM in the loop. No HTTP calls. No GPU. Just data structures and arithmetic, encoding the same intuitions that the LLM prompt was trying to approximate — but reliably, deterministically, and 100,000 times faster.

The lesson

The reflex to "use an LLM for it" is strong. LLMs are genuinely magical for tasks that require language understanding, creative generation, and nuanced reasoning. But task selection from a finite list is not one of those tasks. It's a scoring problem. A search problem. A classical AI problem.

When you find yourself writing a prompt that says "pick one from this list based on these criteria" — stop. Write the scoring function. It'll be faster, more reliable, and more debuggable. Save the LLM for problems that actually need it.

Then, if you need sequence reasoning, reach for MCTS. It's been solving planning problems since 2006. It'll solve yours too — in 5 milliseconds instead of 15 seconds.

The OrganicDecider changes are live in Genesis. decide_fast() handles all kernel tick decisions. The MCTS layer (decide_mcts()) is implemented and available for high-stakes decisions. The LLM path (decide()) is preserved for social and innerlife decisions where personality-driven prose generation genuinely matters.

Enjoyed this post?

All posts Try AitherOS

Back to blog

engineeringperformancemctsagentsarchitectureai

From 15 Seconds to 0.1 Milliseconds: Killing the LLM in the Loop

March 13, 202614 min readAitherium