engineeringarchitectureagentssandboxide

AI That Runs What It Writes: Wiring Live Docker Containers Into the Agent Pipeline

March 12, 202611 min readAitherium

There is a specific moment that exposes every AI coding assistant as a parlor trick: the moment after code generation, when the user copies the output into a terminal and discovers it does not work.

The model was confident. The syntax looked right. The explanation was thorough. But the import was wrong, the API changed six months ago, the function signature has a subtle incompatibility, and now you are debugging code that an AI assured you was correct. The model never ran it. It does not know how. It generated text that looks like code and handed it to you with a smile.

We decided to fix this. Not with better prompting, not with chain-of-thought, but with actual infrastructure: Docker containers that spin up automatically when agents generate code, live previews that render in the IDE before the agent finishes its response, and tool-level access so agents can iterate on running containers in real time.

This is the story of how we wired AitherSandbox into the entire stack.

The Problem Is Architectural, Not Intellectual

The reason most AI agents cannot run their own code is not because the models are incapable of understanding execution. It is because the system around them was never designed for it.

A typical AI coding workflow looks like this:

User asks for a web app
Model generates code
Model returns code as text in a chat message
User manually creates files, installs dependencies, runs the code
Something breaks
User pastes the error back into the chat
Goto 2

This is a communication protocol, not an engineering system. The model has no hands. It cannot create files, start processes, observe output, or iterate. It is a brain in a jar, producing text that describes what it wishes it could do.

The fix is not to make the brain smarter. The fix is to give it hands.

AitherSandbox: The Existing Hands

AitherSandbox has existed in our stack for months — a Layer 7 service at port 8131 that provides two capabilities:

Code execution: Run Python in an isolated subprocess with AST validation, restricted builtins, import filtering, resource limits, and security scanning. The Genesis-absorbed router adds RBAC, ProcessPoolExecutor isolation, and a custom __import__ hook that blocks dangerous modules.

Web preview containers: Spin up ephemeral Docker containers with dev servers, auto-detect the runtime (Node, Python, or static), mount source files, map a host port from the 9100-9199 range, and return a preview URL. Containers are --network none, memory-capped at 512MB, CPU-limited to 1.0, and auto-cleaned after TTL expiry.

The infrastructure was solid. The problem was that nothing used it properly.

Demiurge had a CodeExecutor class that could run Python snippets through the sandbox. The standalone sandbox page in the Veil dashboard could display containers. But these were disconnected islands. Forge agents — the identity-aware subagents dispatched by AgentForge — had no sandbox tools at all. The SwarmCodingEngine's 11-agent pipeline tested code via bare subprocess calls instead of Docker isolation. And the Forge IDE, our AI-native development environment, had zero awareness of running containers.

The agent could generate code but could not see it run. The IDE could edit files but could not display what those files produced. The sandbox existed but nobody had finished the wiring.

Four Layers of Integration

Layer 1: Agent Tools

The first layer was giving agents the ability to create and manage preview containers as tool calls during their ReAct loops.

We added register_sandbox_preview_tools() to AgentRuntime with four tools:

create_sandbox_preview — Takes a dict of filename→content, container type (auto/node/python/static), optional entry and install commands, port, and TTL. Returns a session ID and preview URL.
update_sandbox_files — Hot-reloads files in a running container. The dev server detects changes automatically.
list_sandbox_previews — Returns all active containers with status, ports, and remaining TTL.
stop_sandbox_preview — Cleans up a container by session ID.

These tools are registered inside register_sandbox_tools(), which already provided execute_code. Now a single registration gives agents both code execution and container management.

The key design decision: these tools go through the SandboxClient (our canonical async client for port 8131), not through direct HTTP calls. This means they participate in the same health checking, connection pooling, and error handling as every other service client in the stack.

Layer 2: AgentForge Tool Registry

AgentForge's _build_tool_registry() assembles tools based on each agent's tool_profile — a YAML list in the identity config that controls what capabilities each agent gets. We added sandbox as a new tool category:

if allow_all or "sandbox" in profile or "code_forge" in profile:
    register_sandbox_tools(registry)

The or "code_forge" is deliberate: any agent that can generate code should be able to run it. This covers Demiurge, Aither, and Hydra out of the box. We also updated their identity YAMLs to include sandbox explicitly, because tool profiles should be self-documenting.

Now when Demiurge is forged as a subagent for a coding task, it automatically gets create_sandbox_preview in its tool registry. During its ReAct loop, it can generate a React app, call the tool to spin up a container, observe the result, and iterate — all within a single agent turn.

Layer 3: Delivery Pipeline

The SwarmCodingEngine runs an 11-agent pipeline: ARCHITECT → SWARM → REVIEW → JUDGE. After the JUDGE phase, the SwarmDeliveryPipeline packages the artifacts, runs syntax checks and pytest, and mails a report.

We added create_preview_for_bundle() to the delivery pipeline. After sandbox_test_bundle() runs the local syntax and test checks, we fire off a background task that:

Scans all code files in the bundle for web-previewable content (HTML, JSX, TSX, CSS, Vue, Svelte)
If web files exist, calls SandboxClient.create_preview() with the files
Logs the preview URL

This means when the swarm finishes building a web app, a Docker container is already starting before the delivery email goes out. By the time the user reads the report, they can click into the Forge IDE and see the running result.

Layer 4: The Forge IDE

This is where the user actually sees the containers.

ActivityBar: We added a sandbox item with a Container icon to the VS Code-style activity bar on the left. Click it and the left panel shows the SandboxPreviewPanel.

SandboxPreviewPanel: A sidebar component that polls /api/sandbox/previews every 3 seconds. It displays all active containers with:

Container type icons (⬡ Node, 🐍 Python, 📄 Static)
Status dots (green = running, red = error, gray = stopped)
Port numbers and remaining TTL countdowns
Hover actions: open preview, open in new tab, stop container

Click any running container and the main editor area switches to SandboxPreviewView.

SandboxPreviewView: A full-width iframe viewer with a URL bar showing localhost:{port}, container type badge, TTL countdown, and reload/stop/external-link/close controls. The iframe uses sandbox="allow-scripts allow-forms allow-same-origin allow-popups" for security isolation.

The transition is instant: you are editing code in Monaco, Demiurge spins up a container in the background, a green dot appears in the sidebar, you click it, and you are looking at the running app. Click any editor tab to go back to code. The preview session persists until you stop it or the TTL expires.

The Proxy Fix Nobody Noticed

While wiring this up, we discovered a routing bug in the Veil API proxy. The catch-all route at /api/sandbox/[...path] was constructing target URLs as http://localhost:8131/{path}, but AitherSandbox's endpoints are all prefixed with /sandbox/. So /api/sandbox/previews was hitting http://localhost:8131/previews — a 404.

The standalone sandbox page had been working around this by using different endpoint names, which is why nobody noticed. We fixed the proxy to prepend /sandbox/ and corrected the standalone page's session listing endpoint from the nonexistent /preview/list to the actual /previews.

Small bugs in proxy layers are the kind of thing that survives for months because "it works when I test it manually."

What This Changes

Before this work, the agent workflow for code generation was:

Agent generates code as text
User reads it in a chat bubble
User manually creates files and runs them
User reports back

After this work:

Agent generates code
Agent calls create_sandbox_preview with the files
A Docker container starts automatically
The Forge IDE shows a green dot in the sidebar
User clicks it, sees the running app in an iframe
Agent can call update_sandbox_files to iterate
User watches changes hot-reload in real time

The agent is no longer a brain in a jar. It has hands, it can see what those hands produce, and the user can watch the entire process unfold in a single IDE.

The Broader Pattern

This is part of a pattern we keep encountering: the bottleneck in AI systems is rarely the model. It is the infrastructure between the model and the world.

Models can reason about code. They understand program structure, API design, error handling, and testing strategy. What they lack is agency — the ability to act on their reasoning. Every capability you give them (file I/O, web search, tool use, container management) removes one more excuse for the system to shrug and say "I generated some text, good luck."

The path from "AI that writes code" to "AI that ships code" is not about better prompts or larger context windows. It is about plumbing: giving the agent access to the same infrastructure that a human developer uses. Docker containers, file systems, git operations, CI pipelines, live previews.

We will keep building the plumbing.

Technical Reference

New files:

components/forge/sandbox-panel.tsx — Sidebar container list (307 lines)
components/forge/sandbox-preview-view.tsx — Main-area iframe viewer (139 lines)

Modified files (sandbox wiring):

lib/core/AgentRuntime.py — register_sandbox_preview_tools() (4 tools, +190 lines)
lib/orchestration/AgentForge.py — sandbox tool category (+10 lines)
lib/orchestration/SwarmDeliveryPipeline.py — create_preview_for_bundle() (+65 lines)
lib/orchestration/SwarmCodingEngine.py — auto-preview in delivery (+3 lines)
services/agents/AitherDemiurge.py — CodeExecutor.create_preview() (+66 lines)
config/identities/{demiurge,aither,hydra}.yaml — sandbox added to tool profiles
components/forge/layout/ActivityBar.tsx — sandbox activity item
app/demi-2/page.tsx — sandbox panel + preview wired into ForgeLayout
app/api/sandbox/[...path]/route.ts — fixed proxy path prefix
app/sandbox/page.tsx — fixed endpoint path

Total: ~6,500 lines across 41 files.

Enjoyed this post?

All posts Try AitherOS