Unified MCTS: One Algorithm to Plan Them All
We had four Monte Carlo Tree Searches. Four separate implementations of the same algorithm, scattered across the codebase, each solving its own flavor of "given a bunch of options, find the best sequence of actions."
MCTSPlanner for task decomposition. MCTSToolRouter for tool selection and intent routing. NarrativeMCTS for story branching in Saga. And the internal MCTS inside HierarchicalPlanner for dependency-aware scheduling. Four copies of SELECT-EXPAND-SIMULATE-BACKPROPAGATE, each with subtly different node structures, reward calculations, and termination conditions.
This is the story of how we replaced all four with a single unified engine — and in doing so, accidentally solved a much harder problem: making every execution mode in the system composable via tree search.
The Duplication Tax
The four engines had diverged enough that bugs fixed in one didn't propagate to the others. MCTSPlanner had gotten a better UCT implementation with proper exploration weight tuning. MCTSToolRouter had developed an online learning integration with ConsumptionTracker that improved tool selection over time. NarrativeMCTS had a beautiful multi-dimensional scoring system with coherence, tension, and prose quality metrics. HierarchicalPlanner's internal MCTS had dependency-aware action generation.
All good ideas. None shared.
Every time we wanted to improve the core algorithm — add world model predictions, tune the exploration/exploitation trade-off, add time budgeting — we had to do it four times. And inevitably, we'd do it in one place, verify it worked, and forget about the other three until something broke.
The cognitive overhead of maintaining four MCTS engines was worse than the computational overhead of running them.
The Protocol
The insight was simple: MCTS doesn't care what the actions mean. The algorithm is always the same. What changes between domains is:
- What state looks like — a partial plan, a set of selected tools, a story node, or an expedition schedule
- What actions are available — add a planning step, include a tool, branch the narrative, or dispatch a task
- How to evaluate quality — plan coverage score, tool relevance, narrative coherence, or task completion rate
- How to advance state — step the environment forward by applying an action
These four concerns map perfectly to a protocol:
@runtime_checkable
class MCTSEnvironment(Protocol):
def get_state_hash(self) -> int: ...
def get_actions(self) -> List[Any]: ...
def step(self, action: Any) -> Tuple[Any, float, bool]: ...
def evaluate(self) -> float: ...
def clone(self) -> "MCTSEnvironment": ...
Five methods. That's the entire interface. Any domain that implements these five methods gets the full power of MCTS: UCT-based tree selection, world model predictions, configurable exploration, time budgeting, confidence estimation, and FluxEmitter observability — all for free.
The unified engine itself is 580 lines. No domain-specific code. Pure algorithm.
The Environments
Each domain becomes an adapter:
PlanningEnv
Wraps the task decomposition problem. State is a partial plan (list of steps taken so far). Actions are available next steps generated by the ActionGenerator. Reward is the PlanEvaluator's composite score for plan coverage, feasibility, and risk.
env = PlanningEnv(problem)
engine = UnifiedMCTS(MCTSConfig(iterations=200))
result = await engine.search(env)
# result.best_action_path = ordered plan steps
ToolSelectionEnv
Turns tool curation into a selection problem. State is which tools have been selected so far. Actions are remaining tools. Reward combines relevance (keyword overlap with the task), diversity (category spread), and cost awareness (cheaper tool sets score higher). Search terminates when the target toolkit size is reached.
WorkflowEnv — The Composition Layer
This is where it gets interesting. WorkflowEnv doesn't solve a single domain problem. It makes every AitherOS execution mode into an MCTS action:
Actions (modes as choices):
- forge_dispatch(agent, task) — Single agent with ReAct loop
- swarm_execute(problem, mode) — 11-agent coding pipeline
- council_deliberate(topic, agents) — Multi-agent review
- rlm_explore(query, tools) — Context exploration
- sase_reason(query, depth) — Deep reasoning cycle
- code_act(task) — Dynamic code generation
- direct_llm(prompt, model) — Raw LLM call (cheapest)
Each mode has a cost/quality/latency profile. MCTS explores combinations: "what if we forge-dispatch demiurge to write the code, then council hydra+athena to review it, then swarm the test suite?" The search finds the workflow composition that maximizes quality within the effort and cost budget.
This is the consolidation layer. Before UnifiedMCTS, deciding whether to use forge, swarm, or a council was a hardcoded decision based on effort level. Now it's a search problem. The system explores hundreds of workflow combinations in under a second and picks the one with the best expected outcome.
ExpeditionEnv
Makes expedition task scheduling MCTS-plannable. State tracks which tasks are complete, what phase we're in, and resource consumption. Actions are (task_id, execution_method) pairs — each task can be dispatched via forge or swarm, with different agents. The search finds optimal scheduling that respects dependencies, minimizes cost, and maximizes throughput.
NarrativeEnv
Bridges into Saga's story branching. Since narrative search requires LLM-generated candidates (you can't just "step" a story forward deterministically), this adapter takes pre-generated branch texts and uses MCTS to score and select them via the heuristic narrative quality dimensions: coherence, character consistency, tension, player agency, novelty, and prose quality.
ArcGameEnv
A grid-based environment for testing. Also positioned for ARC-AGI-3, where MCTS needs to explore transformations on 2D grids to find the pattern that maps input to output.
The SASE Bridge
UnifiedMCTS doesn't float in isolation. It's wired into the existing cognitive stack via SASEMCTSBridge:
DeepReasoningHelper.classify_reasoning_depth()
-> CriticalityGate.assess() (SASE framework)
-> SASEMCTSBridge.reason() (this bridge)
-> UnifiedMCTS.search(env) (core algorithm)
-> WorkflowEnv / PlanningEnv (domain adapter)
The SASE phases (Situation, Analysis, Synthesis, Execution) map onto the search:
- Situation = Observe environment state, gather available actions
- Analysis + Synthesis = MCTS search (explore the action space, find the best plan)
- Execution = Dispatch the chosen workflow via AgentForge / Swarm / direct
The bridge also manages a WorldModel that persists across search sessions. Transition probabilities and state value estimates from previous searches inform future ones — the system learns which action sequences tend to produce good outcomes.
The Integration Map
UnifiedMCTS now wires into seven integration points:
AgentForge._auto_route_mcts() → WorkflowEnv (agent chain composition)
AgentForge._build_tool_registry() → ToolSelectionEnv (MCTS tool curation)
SixPillars._run_reasoning_phase() → SASEMCTSBridge (pre-planning for P3)
ExpeditionManager._execute_task() → ExpeditionEnv (task method selection)
SwarmCodingEngine.auto_select() → WorkflowEnv (mode selection)
HierarchicalPlanner.decompose() → PlanningEnv (task decomposition)
EffortScaler.build_plan() → WorkflowEnv (orchestration mode)
Each integration is gated by effort level and wrapped in try/except with graceful fallback to the previous behavior. If UnifiedMCTS fails or isn't available, nothing breaks — the system degrades to the pre-existing heuristic path.
This is critical. MCTS is an optimization, not a dependency. The system ran fine before it, and it will run fine if any individual integration point fails.
Observability
Every MCTS search emits two FluxEmitter events:
MCTS_SEARCH_STARTED carries the search configuration:
{
"domain": "WorkflowEnv",
"max_iterations": 100,
"time_limit": 0.8,
"max_depth": 4,
"exploration_weight": 1.41
}
MCTS_SEARCH_COMPLETE carries the results:
{
"domain": "WorkflowEnv",
"iterations": 87,
"best_value": 0.823,
"confidence": 0.91,
"elapsed_seconds": 0.34,
"tree_depth": 3,
"best_action": "forge(demiurge: Implement OAuth2)",
"path_length": 3
}
The complete event also posts to Strata for long-term analysis. ConsumptionTracker and SessionLearner already subscribe to Flux events, so they automatically ingest MCTS search data into the learning loop — closing the feedback cycle without any explicit wiring.
Performance
The unified engine with six domain adapters is covered by 104+ tests. The original four MCTS implementations had 97 tests between them. Combined: 201+ passing tests covering every domain, every integration point, and edge cases like empty action spaces, time budget exhaustion, and world model persistence.
Search performance on the simple grid environment (for benchmarking): sub-millisecond for 10 iterations. For realistic workflow composition with 7 modes and 13 agents, 100 iterations completes in under 800ms — well within the latency budget for agent dispatch.
The key performance insight: MCTS with heuristic evaluation (no LLM calls in the loop) is extremely fast. The bottleneck in the old MCTSToolRouter wasn't the tree search — it was the overhead of maintaining a separate implementation with its own node structures, configuration, and learning hooks. The unified engine eliminates that overhead.
What's Next
The architecture is designed for extensibility. Any new domain that implements the five-method MCTSEnvironment protocol immediately gets:
- Full MCTS search with UCT-based exploration
- World model predictions (state value estimates, transition probabilities)
- Configurable time and iteration budgets
- Confidence estimation from visit distribution entropy
- FluxEmitter observability and Strata ingestion
- SASE integration via the bridge
The next test: ARC-AGI-3. ArcGameEnv already wraps 2D grid transformations as MCTS actions. The challenge is making the evaluate() heuristic good enough to guide search toward correct transformations without brute-forcing the entire space. If the world model can learn grid transformation patterns from training examples, MCTS becomes a powerful reasoning engine for novel puzzles.
Beyond ARC, the WorkflowEnv composition layer opens up automated pipeline optimization. Instead of hand-tuning which tasks get forge vs swarm vs council, the system can explore thousands of configurations offline and learn the optimal dispatch strategy for each task category. Connect that to ConsumptionTracker's outcome data, and you have a self-optimizing orchestration layer.
Four engines became one. One protocol serves every domain. The algorithm doesn't care whether it's planning a code review or branching a story — it just searches for the best path through the tree. As it should be.