Your AI Guesses Which Tools to Use. Ours Plays Chess.
There is a dirty secret in the AI tools ecosystem that nobody wants to talk about.
When your agent has access to fifty tools and the user says "find me the latest research on transformer architectures and save a summary to my notes," what happens behind the scenes is shockingly dumb. The system takes all fifty tool definitions — search, file I/O, image generation, email, database queries, encryption, deployment, social media posting, audio transcription, everything — serializes them into a massive JSON blob, stuffs them into the context window, and asks the language model: "hey, which of these do you want to call?"
The model picks one. Maybe the right one. Maybe a plausible-sounding wrong one. Maybe it hallucinates a tool that doesn't exist.
This is how tool use works in every major agent framework today. It is the equivalent of handing someone a phone book and asking them to find a plumber, except the phone book also contains entries for dentists, florists, and aircraft mechanics, and some of the phone numbers are made up.
We decided to fix this.
The Problem Is Combinatorial
The naive approach to tool selection fails in three ways:
1. Context window pollution. Fifty tool schemas consume thousands of tokens. Every token spent describing encrypt_data on a research task is a token the model can't use for actual reasoning. Studies consistently show that as the number of function schemas increases, tool-calling accuracy decreases — even for frontier models. The model gets confused by options it will never need.
2. No lookahead. When a task requires a sequence of tools — search, then fetch a webpage, then extract text, then summarize, then save to file — the model picks tools one at a time. It has no ability to reason about which combination of tools will produce the best end result. It optimizes greedily at each step.
3. No learning. If the model picks the wrong tool, nothing changes for next time. There is no feedback loop. The same mistake pattern repeats across every session, every user, every task.
These aren't edge cases. They are the normal operating mode of every tool-using AI agent shipped in 2025.
Enter Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS) is the algorithm that made AlphaGo possible. It's how a computer learned to beat the world champion at a game with more possible positions than atoms in the universe.
The core idea is elegant: when the decision space is too large to evaluate exhaustively, you simulate many possible paths, keep track of which paths lead to good outcomes, and use that information to make better decisions. The algorithm balances two competing pressures:
- Exploitation — keep doing what's worked before
- Exploration — occasionally try something new to discover better options
This balance is maintained by a formula called UCT (Upper Confidence Bound for Trees):
Where is the accumulated reward, is the visit count for a candidate, is the parent's visit count, and is the exploration constant. In plain English: the first term picks winners — tools that have delivered results keep getting picked. The second term gives underdogs a shot — tools that haven't been tried much get a bonus so the system doesn't get stuck in a rut.
If you squint, this is exactly the tool selection problem. We have a large space of possible tool subsets. We can't evaluate them all. We need to balance using tools that have worked well historically with occasionally trying tools we haven't used much. And we need to do it in under 500 milliseconds.
How We Apply MCTS to Tool Selection
Here's the architecture. It runs before the LLM ever sees a tool schema.
The Search Tree
The root node represents an empty tool selection. Each level of the tree represents adding one tool to the set. A leaf node represents a complete toolkit — typically 3 to 8 tools curated specifically for this task.
[empty]
/ | \
[search] [file] [code]
/ \ |
[search, [search, [file,
fetch] save] edit]
|
[search, fetch,
summarize]
Each path through the tree represents a different tool combination. MCTS explores many paths simultaneously, evaluating each one against the task.
The Four Phases
On every iteration (we run 100-150 in under 500ms):
1. Select. Walk down the tree following the UCT formula. This naturally gravitates toward promising tool combinations while occasionally branching into unexplored territory.
2. Expand. At a leaf, add a new child by including one more tool from the candidate pool. The tool is removed from the untried set so the same combination isn't explored twice.
3. Simulate. From the expanded node, randomly complete the toolkit to the target size. This gives us a full tool set to evaluate without exhaustive search.
4. Backpropagate. Score the simulated toolkit and propagate the reward back up to the root. Every node along the path updates its average reward and visit count.
After all iterations, we extract the best toolkit by following the most-visited children from root to leaf. Most-visited — not highest-average — because visit count is a more robust selection criterion. A tool that was selected 80 times out of 150 iterations is a stronger signal than one that scored perfectly but was only visited twice.
The Scoring Function
Each simulated toolkit is scored across five dimensions:
| Dimension | Weight | What it measures |
|---|---|---|
| Relevance | 35% | Keyword and semantic overlap between the task and each tool's description |
| Coverage | 25% | Do the selected tools cover all aspects of the task? A task that involves search AND file operations should have tools for both. |
| Efficiency | 20% | Is the set small enough? The sweet spot is 3-8 tools. More than that and LLM accuracy degrades. Fewer is fine if they're relevant. |
| Historical success | 10% | Have these tools succeeded when selected for similar tasks in the past? |
| Complementarity | 10% | Are the tools diverse? Three search tools is worse than one search tool, one file tool, and one analysis tool. |
The total score is a weighted sum, normalized to [0, 1]. This becomes the reward signal for MCTS backpropagation.
MCTS for Agent Routing
Tool selection was the first application, but the same algorithm solves an even harder problem: multi-agent delegation.
When a complex task arrives — "Research the best authentication patterns for Next.js and implement OAuth2" — it shouldn't go to a single agent. It should flow through a chain of specialists: a research agent gathers patterns, a coding agent implements the solution, a review agent validates it.
Most systems route to a single agent using keyword matching or, at best, a scoring function. We use MCTS to explore delegation chains.
The Agent Tree
[empty]
/ | \
[research] [code] [planning]
/ \ |
[research, [research, [code,
code] review] review]
|
[research, code,
review]
Each path represents a different agent delegation sequence. The MCTS evaluator scores chains on:
| Signal | Weight | What it measures |
|---|---|---|
| Skill match | 30% | Do the agents' declared skills cover the task requirements? |
| Effort fit | 20% | Is the chain's capacity appropriate for the task complexity? |
| Success history | 20% | How have these agents performed historically? |
| Estimated latency | 15% | What's the expected end-to-end wall-clock time? Chain latency is additive — each agent adds its average response time. |
| Load awareness | 15% | Are these agents currently busy or idle? Prefer idle agents over saturated ones. |
Effort-Aware Chain Depth
Not every task needs a multi-agent pipeline. The router automatically adjusts chain depth based on task complexity:
| Effort Level | Chain Depth | Example |
|---|---|---|
| 1-2 (trivial) | 1 agent | "What port does Chronicle run on?" — Atlas answers from config in one hop |
| 3-5 (moderate) | 1-2 agents | "Summarize yesterday's Flux event bus errors" — Iris fetches logs, then distills |
| 6-7 (complex) | 2-3 agents | "Research OAuth2 patterns for Next.js and scaffold it into AitherVeil" — a research agent gathers patterns, a coding agent writes the implementation |
| 8-10 (expert) | 3-4 agents | "Audit our RBAC config, find privilege escalation paths, patch them, and validate the fix" — Vera audits, Atlas designs, a coding agent patches, Vera re-validates |
A simple question gets routed directly. A complex multi-step task gets a full delegation pipeline. The MCTS search naturally converges on the appropriate depth because shorter chains score higher on latency while longer chains score higher on coverage and skill match.
The Learning Loop
Here's where it gets interesting. MCTS doesn't just select tools and forget about it. Every selection feeds into a consumption tracker.
When a tool is selected and used, we record it. When the task succeeds, we credit the tools that were part of the successful selection. When it fails, we note that too. Over time, this builds a historical success rate per tool and per agent — per task type.
This is the same concept as delayed credit assignment in reinforcement learning. You don't know if a tool selection was good until after the task completes. So you defer the credit and assign it retroactively.
The success rates feed back into the MCTS evaluator's "historical success" signal. Tools that reliably succeed in context get naturally preferred. Tools that reliably fail get deprioritized — but not eliminated, because the UCT exploration term ensures they still get occasional chances to redeem themselves.
This creates a flywheel:
Better selection → Higher success rates → Better historical signals → Even better selection
The system gets smarter with every task it handles, without any model fine-tuning, prompt engineering, or manual curation. The MCTS parameters are self-correcting.
Why Not Just Use a Classifier?
A reasonable objection: "Why not train a classifier to predict which tools are needed? Fine-tune a small model on (task, tools) pairs."
Three reasons:
1. Distribution shift. Tools change. New tools are added. Old tools are deprecated. A classifier trained on last month's tool registry is already stale. MCTS adapts immediately because it evaluates tools based on their current descriptions and capabilities, not a frozen training distribution.
2. Combinatorial reasoning. A classifier predicts individual tools. MCTS reasons about sets of tools. The value of including fetch_webpage depends on whether web_search is already in the set. MCTS captures these interactions naturally through its tree structure. A classifier would need exponentially many training examples to learn pairwise tool interactions.
3. No infrastructure. MCTS runs in pure compute. No training data pipeline. No GPU for inference. No model versioning. No retraining schedule. It's a search algorithm that runs in 300 milliseconds on a single CPU core.
Performance Characteristics
We designed the system with hard time constraints. Tool selection must complete in under 500ms. Agent routing must complete in under 1 second. These aren't aspirational targets — they're enforced time limits in the search loop.
In practice:
| Metric | Tool Selection | Agent Routing |
|---|---|---|
| Typical iterations | 100-150 | 150-200 |
| Wall-clock time | 200-400ms | 300-800ms |
| Candidate pool (pre-filter) | 15-25 tools | 8-12 agents |
| Output size | 3-8 tools | 1-4 agents |
| Confidence range | 0.6-0.95 | 0.5-0.9 |
The confidence score uses normalized entropy of the visit distribution. If one path dominates (80% of visits), confidence is high. If visits are spread evenly across many paths, confidence is low — the system is genuinely uncertain. Low confidence triggers fallback to simpler routing strategies rather than making an uncertain commitment.
The Pre-Filter Optimization
Running MCTS over fifty tools would waste iterations on obviously irrelevant candidates. Before the search begins, a fast pre-filter scores every tool by keyword overlap and historical success rate, keeping the top candidates. For a target of 8 tools, that's 24 candidates entering MCTS — small enough for efficient search but large enough for meaningful exploration.
Agents go through a similar pre-filter based on skill match, effort fit, success rate, and current load. This narrows the field from potentially dozens of agents to the top 10-12 before MCTS begins exploring chains.
The Bigger Picture
Tool selection and agent routing are symptoms of a deeper problem in AI system design: we treat language models as universal decision engines when they are actually universal text completers.
An LLM can generate a plausible-looking tool call. But "plausible-looking" and "optimal" are not the same thing. The model has no concept of tool complementarity, no awareness of historical success rates, no ability to simulate multi-step outcomes, and no mechanism to learn from past decisions.
MCTS doesn't replace the LLM. It curates the decision space so the LLM only sees options that have survived a rigorous search process. The model still decides the final tool arguments and execution details. But it's choosing from a hand-picked lineup of 5 tools instead of fumbling through a phonebook of 50.
This is the pattern we think will define the next generation of agent systems: search algorithms that reason about the decision space, feeding curated options to language models that execute within that space. The LLM is the executor. The search algorithm is the strategist.
AlphaGo didn't beat Lee Sedol by generating plausible-looking moves. It beat him by searching deeper than any human could, then using a neural network to evaluate what it found. The neural network was critical — but without the search, it was just another Go program.
Your AI agent is just another chatbot until you give it a search algorithm.
The road ahead has clear next moves: LLM-in-the-loop evaluation for ambiguous tasks, cross-session transfer so the system gets smarter for everyone, hierarchical search to scale beyond 200 tools, and dynamic re-routing that adapts mid-execution when an agent stumbles. These are active research directions. The foundation is running in production today.
The Bottom Line
If your AI system has more than ten tools and you're sending all of them to the LLM on every request, you're leaving performance on the table. The model is doing work that a search algorithm does better, faster, and with the ability to learn from outcomes.
Monte Carlo Tree Search is not new. It's been solving decision-making-under-uncertainty problems since the 1990s. What's new is applying it to the tool selection and agent routing problems that every production AI system faces.
The implementation is straightforward. No training data. No GPU requirements. No external dependencies. Just a tree, a scoring function, and the UCT formula that's been winning games for thirty years.
Your move.