Every AI orchestration tool gives you a way to chain LLM calls together. Few of them let you see what's happening inside those chains. Fewer still learn from their own executions. AitherForge does both.

The Problem: Invisible Pipelines

AitherOS already had substantial orchestration infrastructure — AgentForge for identity-aware dispatch, SwarmCodingEngine for 12-agent parallel coding, IntentChainRunner for effort-based routing, PipelineRunner for Unix-pipe data composition. But all of it was YAML-and-code-only. You couldn't visualize a workflow, replay a failed stage, or compare what would happen if you swapped the model on step 3.

Meanwhile, our Veil dashboard had a beautiful workflow builder UI sitting there calling Genesis endpoints that didn't exist yet. Five React components, custom canvas with drag-and-drop, seven step types — all dressed up with nowhere to go.

Today we wired it all together and added the features that make it genuinely useful.

What AitherForge Does

11 Step Types, One Engine

The WorkflowEngine handles eleven distinct step types, each delegating to the appropriate existing subsystem:

Agent Call — AgentForge dispatch with identity resolution and effort scaling
HTTP Request — Direct service calls with variable interpolation
Transform — PipeTransforms (filter, sort, map, jq — the 10 Unix-pipe transforms we already had)
Pipeline — Nested PipelineRunner execution
Condition — Expression evaluation with branching
Parallel — asyncio.gather with concurrency limits
Loop — Collection iteration with per-item variable binding
Delay — Timed pauses (capped at 5 minutes)
Human Gate — Suspends execution, waits for human input via asyncio.Event
MCTS Branch — Monte Carlo Tree Search across agent/model/prompt combinations
Script — AitherZero PowerShell automation scripts with path-traversal prevention

Steps are connected as a DAG. The engine performs topological sort before execution, so you can wire steps in any order on the canvas and it figures out the right execution sequence.

Per-Stage X-Ray Debugging

Every stage execution produces a StageTrace — a complete X-ray of what happened:

Input/output snapshots — what data went in, what came out
Prompt sent — the exact text after variable interpolation
Model and agent used — which identity was resolved, which model served it
Token count and cost — per-stage billing transparency
Latency — wall-clock time in milliseconds
Error details — full stack trace if something broke

The Veil frontend renders this as a Gantt-style timeline (how long each stage took relative to the whole run) and a detailed X-ray panel (click any stage to inspect its internals).

This is the kind of observability that separates "it worked" from "I understand why it worked." When you're chaining three agents, a condition, and a parallel block, knowing that stage 4 burned 3,200 tokens in 2.4 seconds while stage 2 only needed 400 tokens in 0.3 seconds is the difference between optimizing your pipeline and guessing at it.

Human-in-the-Loop Gates

Sometimes you need a human to approve before the pipeline continues. The human_gate step type suspends the workflow using an asyncio.Event and emits a FluxEmitter event so the dashboard can show the pending approval. When someone resolves the gate (via the UI or MCP tool), execution resumes from exactly where it paused.

This is the same pattern CI/CD systems use for deployment approval gates, but applied to agent workflows. Want to review the research agent's output before the writing agent starts drafting? Drop a human gate between them.

Stage Replay

Completed stages can be replayed with overrides. Want to see what happens if you use a different model on step 3? Call replay_stage(run_id, stage_id, {"model": "qwq:32b"}) and get a new trace to compare against the original.

This turns every workflow execution into an experiment. Run it once, identify the bottleneck stage, replay it with a faster model, compare the output quality. The system captures both traces, which becomes DPO training data (more on that below).

MCTS: Let the Tree Search Pick the Best Agent

This is the feature I'm most excited about. The mcts_branch step type runs Monte Carlo Tree Search to evaluate multiple agent/model/prompt combinations and pick the winner.

How It Works

You configure branches — each one specifies an agent, an optional model override, and a prompt variant:

- type: mcts_branch
  config:
    candidates: 3
    mode: standard
    branches:
      - agent: demiurge
        model: qwen3:8b
        prompt_variant: concise
      - agent: demiurge
        model: qwq:32b
        prompt_variant: detailed
      - agent: hydra
        prompt_variant: review_first

The engine dispatches all branches concurrently via asyncio.gather, then scores each result on five dimensions:

Dimension	Weight	What It Measures
Success Likelihood	0.30	Output exists, no errors, reasonable length
Quality	0.25	Completeness, structure (code blocks, headers), vocabulary diversity
Cost Efficiency	0.20	Tokens used relative to a 10K baseline
Latency	0.15	Wall-clock time normalized against 60s
Safety	0.10	No leaked emails, passwords, phone numbers

Scoring is entirely heuristic — no extra LLM calls. The UCB1 formula (the same exploration/exploitation balance used in AlphaGo) selects which branches to explore when running in deep mode with multiple iterations.

The full exploration tree is returned as a glass-box output: you can see every branch's score breakdown, the selected winner, and all the alternatives. The frontend visualizes this directly.

The Real Question MCTS Answers

The surface-level value is "pick the best response." The deeper value is answering the question every agent orchestration system avoids: for this specific task, which combination of agent identity, model size, and prompt style produces the best result per dollar?

A 32B reasoning model might produce better output than an 8B model, but is it 4x better? For a summarization task, maybe not. For a complex code review, probably yes. MCTS finds the empirical answer rather than relying on vibes or defaults.

World Model Training

Every MCTS exploration generates training data in the form (state, branches, scores, winner). This is exactly the format needed to train a world model — a small model that predicts "given this workflow state, which agent/model/prompt combination will produce the best outcome?"

Over time, the system learns which agents are best at which tasks, which models offer the best cost/quality tradeoff, and which prompt styles work for different problem types. The MCTS becomes faster because the world model provides better priors for the UCB1 exploration.

The loop: MCTS explores → training data captured → world model improves → MCTS explores more efficiently → better training data → better world model.

Training Data: Every Execution Teaches the System

AitherForge generates four types of training data from normal workflow execution. No special "training mode." No manual annotation. Just running workflows produces the signal.

1. MCTS Explorations → World Model

(state, branches, scores, winner) tuples from every MCTS branch evaluation. Trains a model to predict optimal agent/model selection without running the actual branches.

2. Human Gate Decisions → RLHF

(context, prompt, human_choice) from every human approval gate. The human's decision (approve, reject, modify) is a direct preference signal. If a human consistently rejects outputs from agent X on task type Y, that's actionable data.

3. Replay Comparisons → DPO

When someone replays a stage with different settings, we capture (input, output_A, output_B, preferred). This is Direct Preference Optimization data — the human explicitly chose to try a different approach, implying the replay was an improvement attempt.

4. Full Workflow Traces → Planning

Complete execution logs with every stage's inputs, outputs, timing, and costs. Trains a model to generate workflow definitions from task descriptions. "Given a task like 'research and write a blog post,' produce a 4-stage workflow with agent assignments."

All training data is shipped to Strata's HOT tier via POST /api/v1/ingest/workflow-trace and flows into the existing Harvest-to-NanoGPT training pipeline. The same infrastructure that trains on conversation data and daydream sessions now ingests workflow execution traces.

The Agent Builder

While building AitherForge, we also added an Agent Builder — an API and future visual editor for agent identity YAML files. You can:

Browse all 16+ agent identities with their capabilities and effort caps
Edit any identity's tools, skills, effort limits, and delegation permissions
Create new agents with a structured API
Test any agent by sending a test prompt through AgentForge and seeing the result

This pairs naturally with the workflow builder: design an agent's identity, then build a workflow that uses it, then let MCTS tell you whether your new agent is actually better than the default for the task you had in mind.

Architecture: Reuse Over Reinvention

AitherForge is a thin orchestration layer that delegates to existing subsystems rather than reimplementing them:

Step Type	Delegates To	Already Existed
Agent	AgentForge (identity, effort, tenant)	Yes
Service	httpx + AitherPorts	Yes
Transform	PipeTransforms (10 transforms)	Yes
Pipeline	PipelineRunner (YAML pipelines)	Yes
Script	ScriptExecutor → PowerShell	New (thin wrapper)
MCTS	WorkflowMCTS (from NarrativeMCTS)	Generalized
Events	FluxEmitter (8 new EventTypes)	Extended
Persistence	WorkflowStore (SQLite)	New (follows PackageDB)
Training	Strata ingest	Extended

The total implementation is roughly 4,800 lines across 11 new files and 8 modified files. Most of the "new" code is wiring — connecting existing capabilities through a unified execution interface and exposing them through the visual frontend.

The WorkflowEngine itself doesn't know how to call an LLM or run a pipeline. It knows how to sort a DAG, execute steps in order, record traces, handle human gates, and delegate everything else to the subsystem that already knows how to do it. This means every improvement to AgentForge, PipelineRunner, or the LLM stack automatically improves workflow execution.

MCP Tools: Agents Building Workflows

AitherForge ships with 11 MCP tools that let any agent (or external caller) manage workflows programmatically:

create_workflow / update_workflow / delete_workflow — CRUD
execute_workflow / get_workflow_run — execution and monitoring
replay_workflow_stage — A/B testing stages
resolve_workflow_gate — unblocking human gates
estimate_workflow_cost — cost projection before running
export_workflow_yaml — convert to pipeline format

Atlas (our project manager agent) now has workflow_management and workflow_composition skills. The next step is Atlas composing workflows from natural language: "Build me a pipeline that researches a topic, writes a draft, reviews it with Hydra, and publishes if approved."

What's Next

The foundation is laid. Coming up:

React Flow upgrade — replace the custom canvas with @xyflow/react for proper minimap, auto-layout, and connection validation
Live execution streaming — WebSocket updates as stages complete, real-time Gantt timeline
World model integration — train NanoGPT on MCTS exploration data and use it for UCB1 priors
Workflow marketplace — share workflow templates through APM
Atlas autonomous composition — natural language to workflow definition

The goal has always been the same: an AI system that gets better at orchestrating AI by learning from its own executions. AitherForge is where that feedback loop becomes visible, debuggable, and — most importantly — self-improving.

Every workflow you run makes the next one smarter. That's not a slogan. It's the architecture.

Enjoyed this post?

All posts Try AitherOS

Back to blog

engineeringworkflowsmctstrainingagentsarchitecture

AitherForge: Visual Workflow Builder with MCTS Branching and Training-First Design

March 15, 202614 min readAitherium