Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Connecting to services…

•

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

engineeringarchitecturereasoningadaptive-computelocal-aiphilosophy

Looped LLMs, Adaptive Compute, and the Case for Local Reasoning

April 19, 20268 min readAitherium

TikTok released Ouro. Together AI launched Parcae. OpenMythos dropped an open-source reproduction of what Claude Mythos is suspected to be under the hood — a Recurrent-Depth Transformer with switchable MLA/GQA attention and sparse MoE feed-forward layers, where the model itself decides how many laps to take through its own weights before emitting a token.

Easy prompt? One pass. Hard prompt? Full existential crisis mode.

No chain-of-thought. No "let me think step by step" theater. The reasoning happens silently in continuous latent space — which is just the mathematical term for pacing in your bedroom at 3 AM until the answer arrives.

We spent five years scaling width. Then depth. Then test-time tokens. The new frontier is sheer stubbornness.

This is exactly what AitherOS has been doing at the system level since March.

The Looped LLM Thesis

The core insight behind looped architectures is simple: not all problems are equally hard, so not all problems should consume equal compute.

A fixed-depth transformer spends the same number of FLOPs whether you ask "what's 2+2" or "prove the Riemann hypothesis." That's wasteful on the easy end and insufficient on the hard end. Recurrent-depth models fix this by letting the model loop through its own layers repeatedly — the network decides when it's "done thinking."

The architectural pattern looks something like:

Input → Embedding → [Transformer Block × N] → Loop Decision Gate
                         ↑__________________________|
                         (if not confident, loop again)

Each loop refines the latent representation without emitting tokens. The model isn't generating chain-of-thought text — it's iterating in the continuous space of its own hidden states. When confidence crosses a threshold (or a compute budget runs out), it exits the loop and generates output.

This is elegant. It's also exactly one level of abstraction below what we built.

AitherOS: Adaptive Compute at the System Level

In AitherOS, the EffortScaler is the loop decision gate — but instead of operating inside a single model's weights, it operates across the entire inference stack.

Every incoming message flows through IntentEngine, which classifies the problem type, complexity, and whether the question contains traps. This produces an effort level from 1 to 10. The EffortScaler then builds an ExecutionPlan that determines everything: which model to use, how much context to assemble, whether to enable tools, reasoning depth, verification strategies, and token budgets.

Effort	Model	Compute Budget	What Happens
1–2	3B reflex (CPU)	~500ms	Direct answer, skip context pipeline
3–4	8B orchestrator (GPU)	~2s	Full context assembly, basic tool access
5–6	8B + tools + RLM	~5s	Reinforcement learning loops, graph reasoning
7–8	8B + 14B reasoning	~10s	Deep think pipeline, MCTS planning, council review
9–10	Full stack + subagents	~30s	Agent swarms, multi-turn tool chains, sandbox execution

This is the same principle as a recurrent-depth transformer — the system decides how hard to stew on a problem — but with a critical advantage: each "loop" can use a completely different model architecture, tool set, and verification strategy. A looped LLM can only iterate through its own weights. A looped system can escalate from a 3B model to an 8B model to a 14B reasoning model to a council of six simulated perspectives arguing about the answer.

The Verification Gate: When Stubbornness Becomes Strategy

The newest addition to this pipeline is the MCTS-planned verification gate. Here's the problem it solves:

Ask an 8B model "How many r's in strawberry?" and it will confidently say "2." It's wrong. There are 3. This isn't a knowledge failure — the model knows how to spell strawberry. It's a counting failure that happens because autoregressive generation doesn't do computation. The model is predicting the next likely token, and "2" is the most likely continuation in its training distribution.

The fix isn't a bigger model. GPT-4 gets this wrong too. The fix is to stop asking the LLM to compute and instead make it write a program that computes.

When IntentEngine detects a question that requires verification — counting, spelling, math, trick questions with false presuppositions, logic puzzles — it sets requires_verification=True. This propagates through the EffortScaler into the ExecutionPlan, and after the LLM generates its candidate answer, the verification gate fires.

Here's the actual flow:

User: "How many r's in strawberry?"

1. IntentEngine classifies: trick/counting question
   → effort=7, requires_verification=True

2. EffortScaler builds plan:
   → enable_verification=True
   → verification_max_strategies=3

3. LLM generates candidate: "2" (wrong)

4. Verification gate fires:
   a. classify_problem() → {programmatic: 1.0, enumeration: 1.0}
      (pure regex, sub-millisecond)

   b. MCTS selects strategies: ["programmatic", "enumeration"]
      (50 iterations, <2 seconds)

   c. Programmatic strategy:
      → LLM writes Python:
        count = sum(1 for c in 'strawberry' if c == 'r')
        print(f"Final Answer: {count}")
      → CodeValidator checks AST safety
      → Restricted exec() runs it
      → Result: 3

   d. Enumeration strategy:
      → LLM spells out: s-t-r-a-w-b-e-r-r-y
      → Counts: r at position 3, 8, 9
      → Result: 3

   e. Consensus: both say 3, candidate says 2
      → REPLACE candidate with verified answer

5. User receives: "3" (correct)

The key insight: the LLM doesn't need to be good at counting. It needs to be good at writing a program that counts. And LLMs are very good at writing Python.

MCTS: Letting the System Argue With Itself

MCTS — Monte Carlo Tree Search — is the algorithm that powered AlphaGo. In AitherOS, it serves a different purpose: strategy selection for verification.

The ReasoningVerificationEnv defines 17 verification strategies across 5 tiers:

Tier 1 (prompt-only): Chain-of-thought, enumeration, adversarial checking, decomposition
Tier 2 (tool-using): Programmatic (code execution), web search verification
Tier 3 (multi-perspective): Council debate, devil's advocate, Socratic questioning
Tier 4 (formal): Constraint satisfaction, proof sketching
Tier 5 (ensemble): Multi-model voting, calibrated confidence

MCTS explores which combination of strategies to apply. The reward function favors tier diversity, cost efficiency, and breadth of coverage. For a counting question, it quickly converges on programmatic + enumeration. For a logic puzzle with false presuppositions, it might select adversarial checking + decomposition + council debate.

The search is fast — 50 iterations with a 2-second time limit — because the action space is small (selecting from ~17 strategies) and the cost model is simple (each Tier 1 strategy costs 0.1, higher tiers cost more).

If MCTS fails or returns an empty path, the system falls back to the top strategies by heuristic score. If all strategies fail, the candidate answer passes through unchanged. Every failure mode defaults to "do nothing" — the verification gate can only improve answers, never make them worse.

The CodeAct Path: LLMs as Program Synthesizers

The programmatic verification strategy is where this gets interesting. Instead of asking the LLM to answer a question directly, we ask it to write code that answers the question. Then we run the code.

# LLM generates this:
words = "strawberry"
count = sum(1 for c in words if c == 'r')
print(f"Final Answer: {count}")

# System executes with restricted builtins:
# Only: print, len, range, str, int, float, sum, sorted,
#        enumerate, min, max, abs, round, chr, ord, zip
# No: open, import, eval, exec, __import__, os, sys, subprocess

The CodeValidator performs AST-level safety analysis before execution — checking for forbidden imports, dangerous builtins, infinite loops, and file system access. The execution sandbox restricts builtins to a whitelist of pure functions. No file I/O. No network access. No imports. Just computation.

This is the same principle behind TikTok's CodeAct and the broader "code-as-reasoning" trend in AI research: programs are a more reliable reasoning medium than natural language. A program either produces the right answer or throws an error. It doesn't hallucinate.

Consensus and Council: When Strategies Disagree

Sometimes verification strategies disagree with each other. The programmatic strategy says "3", the enumeration strategy says "3", but the adversarial strategy says "well actually, the question asked about the letter R not r, so the answer is 0."

The consensus protocol handles this:

Unanimous agreement with candidate: Verification confirms the answer. No change.
Majority disagrees with candidate (2+ strategies): Replace with the majority answer.
No clear majority: Escalate to Council arbitration.

The Council is a multi-perspective review system with six personas — including PRAGMATIST, who serves as the tie-breaker. The PRAGMATIST receives all candidate answers, all verification results, and the original question, then makes a final judgment. If even the Council fails (timeout, import error, ambiguous result), the original candidate passes through unchanged.

The cost ceiling is 5 LLM calls worst case: 3 strategy executions + 1 council arbitration + margin. Typical case is 2–3 calls. Total latency budget: 15 seconds (configurable per plan).

Zero Overhead for Normal Questions

This entire verification pipeline fires only when enable_verification=True in the ExecutionPlan. For normal questions — "What time is it?", "Tell me about photosynthesis", "Write a Python function" — the check is a single getattr() call that returns False. No imports loaded. No computation performed. No latency added.

The gate is:

if (
    content
    and _chat_plan
    and getattr(_chat_plan, "enable_verification", False)
):
    # ... verification pipeline ...

One boolean check. That's the total cost for the 95% of queries that don't need verification.

The Deeper Point: Own Your Reasoning Stack

Here's where looped LLMs and local AI converge into a single argument.

Claude Mythos, Ouro, Parcae — these are fascinating architectures. A model that can dynamically allocate compute to hard problems is strictly better than one that can't. But when that model runs in someone else's cloud, the loop decision is made by their infrastructure, optimized for their cost function, constrained by their safety filters, and shaped by their alignment choices.

When the reasoning stack runs on your hardware, you control every part of the loop:

How hard to think: Your EffortScaler, your thresholds, your cost/quality tradeoff
What to verify: Your IntentEngine, your trick detection heuristics, your domain knowledge
How to verify: Your strategy selection, your code execution sandbox, your council personas
What model to use: Your vLLM instance, your fine-tuned weights, your VRAM allocation
What to remember: Your memory graph, your session history, your context pipeline

AitherOS routes every LLM call through MicroScheduler on port 8150. Every token is generated on a local RTX 5090. The EffortScaler, the verification gate, the MCTS planner, the CodeAct sandbox, the Council — all running on hardware you own, optimizing for objectives you define, with reasoning traces you can inspect.

This isn't an ideological position. It's an engineering one. A reasoning system you can't inspect, modify, or override is a reasoning system you can't trust. And a reasoning system that runs on someone else's profit-maximizing infrastructure is a reasoning system aligned with someone else's goals.

What's Next

The verification gate is Phase 1 of a three-phase architecture:

Phase 1 (shipped): Post-generation verification. MCTS selects strategies. CodeAct writes Python for ground truth. Consensus replaces bad answers.

Phase 2 (planned): Pre-generation MCTS planning. Before the LLM even generates a response, MCTS plans a multi-step approach — which tools to call, which models to use, what context to assemble. The PlanningEnv generates tool orchestration plans. CodeActEngine writes full programs that call MCP tools.

Phase 3 (future): Expedition-level orchestration. Complex problems spawn micro-expeditions — loops of loops. Context accumulates across phases (RAW to DISTILLED to INDEXED to CRYSTALLIZED). Multiple reasoning strategies (DIJKSTRA for shortest-path analysis, EULER for constraint satisfaction, NEUMANN for self-referential problems) operate in an inner loop while the outer loop manages goal tracking and resource allocation.

The endgame is a system that doesn't just decide how hard to think — it decides how to think. Not by routing through bigger models, but by composing reasoning strategies the way a programmer composes functions.

The looped LLM researchers are right about the core insight: adaptive compute is the future. But the loop doesn't have to live inside a single model's weights. It can live in the system that orchestrates the models. And when it does, you can own it.

AitherOS is an open infrastructure project. The verification pipeline, EffortScaler, MCTS planner, and all 196 microservices run on a single machine with an RTX 5090. No cloud dependencies. No API keys for core reasoning. The system thinks as hard as you tell it to, on hardware you control.

Enjoyed this post?

All posts Try AitherOS