AitherForge: Visual Workflow Builder with MCTS Branching and Training-First Design
Every AI orchestration tool gives you a way to chain LLM calls together. Few of them let you see what's happening inside those chains. Fewer still learn from their own executions. AitherForge does both.
The Problem: Invisible Pipelines
AitherOS already had substantial orchestration infrastructure — AgentForge for identity-aware dispatch, SwarmCodingEngine for 12-agent parallel coding, IntentChainRunner for effort-based routing, PipelineRunner for Unix-pipe data composition. But all of it was YAML-and-code-only. You couldn't visualize a workflow, replay a failed stage, or compare what would happen if you swapped the model on step 3.
Meanwhile, our Veil dashboard had a beautiful workflow builder UI sitting there calling Genesis endpoints that didn't exist yet. Five React components, custom canvas with drag-and-drop, seven step types — all dressed up with nowhere to go.
Today we wired it all together and added the features that make it genuinely useful.
What AitherForge Does
11 Step Types, One Engine
The WorkflowEngine handles eleven distinct step types, each delegating to the appropriate existing subsystem:
- Agent Call — AgentForge dispatch with identity resolution and effort scaling
- HTTP Request — Direct service calls with variable interpolation
- Transform — PipeTransforms (filter, sort, map, jq — the 10 Unix-pipe transforms we already had)
- Pipeline — Nested PipelineRunner execution
- Condition — Expression evaluation with branching
- Parallel — asyncio.gather with concurrency limits
- Loop — Collection iteration with per-item variable binding
- Delay — Timed pauses (capped at 5 minutes)
- Human Gate — Suspends execution, waits for human input via asyncio.Event
- MCTS Branch — Monte Carlo Tree Search across agent/model/prompt combinations
- Script — AitherZero PowerShell automation scripts with path-traversal prevention
Steps are connected as a DAG. The engine performs topological sort before execution, so you can wire steps in any order on the canvas and it figures out the right execution sequence.
Per-Stage X-Ray Debugging
Every stage execution produces a StageTrace — a complete X-ray of what happened:
- Input/output snapshots — what data went in, what came out
- Prompt sent — the exact text after variable interpolation
- Model and agent used — which identity was resolved, which model served it
- Token count and cost — per-stage billing transparency
- Latency — wall-clock time in milliseconds
- Error details — full stack trace if something broke
The Veil frontend renders this as a Gantt-style timeline (how long each stage took relative to the whole run) and a detailed X-ray panel (click any stage to inspect its internals).
This is the kind of observability that separates "it worked" from "I understand why it worked." When you're chaining three agents, a condition, and a parallel block, knowing that stage 4 burned 3,200 tokens in 2.4 seconds while stage 2 only needed 400 tokens in 0.3 seconds is the difference between optimizing your pipeline and guessing at it.
Human-in-the-Loop Gates
Sometimes you need a human to approve before the pipeline continues. The human_gate step type suspends the workflow using an asyncio.Event and emits a FluxEmitter event so the dashboard can show the pending approval. When someone resolves the gate (via the UI or MCP tool), execution resumes from exactly where it paused.
This is the same pattern CI/CD systems use for deployment approval gates, but applied to agent workflows. Want to review the research agent's output before the writing agent starts drafting? Drop a human gate between them.
Stage Replay
Completed stages can be replayed with overrides. Want to see what happens if you use a different model on step 3? Call replay_stage(run_id, stage_id, {"model": "qwq:32b"}) and get a new trace to compare against the original.
This turns every workflow execution into an experiment. Run it once, identify the bottleneck stage, replay it with a faster model, compare the output quality. The system captures both traces, which becomes DPO training data (more on that below).
MCTS: Let the Tree Search Pick the Best Agent
This is the feature I'm most excited about. The mcts_branch step type runs Monte Carlo Tree Search to evaluate multiple agent/model/prompt combinations and pick the winner.
How It Works
You configure branches — each one specifies an agent, an optional model override, and a prompt variant:
- type: mcts_branch
config:
candidates: 3
mode: standard
branches:
- agent: demiurge
model: qwen3:8b
prompt_variant: concise
- agent: demiurge
model: qwq:32b
prompt_variant: detailed
- agent: hydra
prompt_variant: review_first
The engine dispatches all branches concurrently via asyncio.gather, then scores each result on five dimensions:
| Dimension | Weight | What It Measures |
|---|---|---|
| Success Likelihood | 0.30 | Output exists, no errors, reasonable length |
| Quality | 0.25 | Completeness, structure (code blocks, headers), vocabulary diversity |
| Cost Efficiency | 0.20 | Tokens used relative to a 10K baseline |
| Latency | 0.15 | Wall-clock time normalized against 60s |
| Safety | 0.10 | No leaked emails, passwords, phone numbers |
Scoring is entirely heuristic — no extra LLM calls. The UCB1 formula (the same exploration/exploitation balance used in AlphaGo) selects which branches to explore when running in deep mode with multiple iterations.
The full exploration tree is returned as a glass-box output: you can see every branch's score breakdown, the selected winner, and all the alternatives. The frontend visualizes this directly.
The Real Question MCTS Answers
The surface-level value is "pick the best response." The deeper value is answering the question every agent orchestration system avoids: for this specific task, which combination of agent identity, model size, and prompt style produces the best result per dollar?
A 32B reasoning model might produce better output than an 8B model, but is it 4x better? For a summarization task, maybe not. For a complex code review, probably yes. MCTS finds the empirical answer rather than relying on vibes or defaults.
World Model Training
Every MCTS exploration generates training data in the form (state, branches, scores, winner). This is exactly the format needed to train a world model — a small model that predicts "given this workflow state, which agent/model/prompt combination will produce the best outcome?"
Over time, the system learns which agents are best at which tasks, which models offer the best cost/quality tradeoff, and which prompt styles work for different problem types. The MCTS becomes faster because the world model provides better priors for the UCB1 exploration.
The loop: MCTS explores → training data captured → world model improves → MCTS explores more efficiently → better training data → better world model.
Training Data: Every Execution Teaches the System
AitherForge generates four types of training data from normal workflow execution. No special "training mode." No manual annotation. Just running workflows produces the signal.
1. MCTS Explorations → World Model
(state, branches, scores, winner) tuples from every MCTS branch evaluation. Trains a model to predict optimal agent/model selection without running the actual branches.
2. Human Gate Decisions → RLHF
(context, prompt, human_choice) from every human approval gate. The human's decision (approve, reject, modify) is a direct preference signal. If a human consistently rejects outputs from agent X on task type Y, that's actionable data.
3. Replay Comparisons → DPO
When someone replays a stage with different settings, we capture (input, output_A, output_B, preferred). This is Direct Preference Optimization data — the human explicitly chose to try a different approach, implying the replay was an improvement attempt.
4. Full Workflow Traces → Planning
Complete execution logs with every stage's inputs, outputs, timing, and costs. Trains a model to generate workflow definitions from task descriptions. "Given a task like 'research and write a blog post,' produce a 4-stage workflow with agent assignments."
All training data is shipped to Strata's HOT tier via POST /api/v1/ingest/workflow-trace and flows into the existing Harvest-to-NanoGPT training pipeline. The same infrastructure that trains on conversation data and daydream sessions now ingests workflow execution traces.
The Agent Builder
While building AitherForge, we also added an Agent Builder — an API and future visual editor for agent identity YAML files. You can:
- Browse all 16+ agent identities with their capabilities and effort caps
- Edit any identity's tools, skills, effort limits, and delegation permissions
- Create new agents with a structured API
- Test any agent by sending a test prompt through AgentForge and seeing the result
This pairs naturally with the workflow builder: design an agent's identity, then build a workflow that uses it, then let MCTS tell you whether your new agent is actually better than the default for the task you had in mind.
Architecture: Reuse Over Reinvention
AitherForge is a thin orchestration layer that delegates to existing subsystems rather than reimplementing them:
| Step Type | Delegates To | Already Existed |
|---|---|---|
| Agent | AgentForge (identity, effort, tenant) | Yes |
| Service | httpx + AitherPorts | Yes |
| Transform | PipeTransforms (10 transforms) | Yes |
| Pipeline | PipelineRunner (YAML pipelines) | Yes |
| Script | ScriptExecutor → PowerShell | New (thin wrapper) |
| MCTS | WorkflowMCTS (from NarrativeMCTS) | Generalized |
| Events | FluxEmitter (8 new EventTypes) | Extended |
| Persistence | WorkflowStore (SQLite) | New (follows PackageDB) |
| Training | Strata ingest | Extended |
The total implementation is roughly 4,800 lines across 11 new files and 8 modified files. Most of the "new" code is wiring — connecting existing capabilities through a unified execution interface and exposing them through the visual frontend.
The WorkflowEngine itself doesn't know how to call an LLM or run a pipeline. It knows how to sort a DAG, execute steps in order, record traces, handle human gates, and delegate everything else to the subsystem that already knows how to do it. This means every improvement to AgentForge, PipelineRunner, or the LLM stack automatically improves workflow execution.
MCP Tools: Agents Building Workflows
AitherForge ships with 11 MCP tools that let any agent (or external caller) manage workflows programmatically:
create_workflow/update_workflow/delete_workflow— CRUDexecute_workflow/get_workflow_run— execution and monitoringreplay_workflow_stage— A/B testing stagesresolve_workflow_gate— unblocking human gatesestimate_workflow_cost— cost projection before runningexport_workflow_yaml— convert to pipeline format
Atlas (our project manager agent) now has workflow_management and workflow_composition skills. The next step is Atlas composing workflows from natural language: "Build me a pipeline that researches a topic, writes a draft, reviews it with Hydra, and publishes if approved."
What's Next
The foundation is laid. Coming up:
- React Flow upgrade — replace the custom canvas with @xyflow/react for proper minimap, auto-layout, and connection validation
- Live execution streaming — WebSocket updates as stages complete, real-time Gantt timeline
- World model integration — train NanoGPT on MCTS exploration data and use it for UCB1 priors
- Workflow marketplace — share workflow templates through APM
- Atlas autonomous composition — natural language to workflow definition
The goal has always been the same: an AI system that gets better at orchestrating AI by learning from its own executions. AitherForge is where that feedback loop becomes visible, debuggable, and — most importantly — self-improving.
Every workflow you run makes the next one smarter. That's not a slogan. It's the architecture.