We run 202 microservices across ~65 Docker containers. Our website, demo.aitherium.com, points at the local Docker stack via a Cloudflare tunnel. Every docker compose up --build on a single service immediately affects the live platform. Every bug in a new image takes down the site until someone notices and manually fixes it.

This is fine when you are the only user. It stops being fine the moment anyone else depends on your platform being up.

The Problem: Zero Isolation

The failure mode was always the same:

Change code in a service
Rebuild the container
New container has a bug
Live site goes down
Frantically roll back manually
Hope nobody noticed

There was no staging environment. No automated rollback. No way for our AI agents to promote or demote deployments. Everything was :latest all the way down.

Why Not Just Use Kubernetes?

The obvious answer is "put it on k8s with rolling deploys." But we are running on a single machine with one GPU. Kubernetes is designed for clusters. Running k3s would eat RAM we need for the actual services, add operational complexity, and solve a problem we do not actually have (multi-node orchestration). We needed isolation, not orchestration.

The Architecture: Shared Backbone + Ring Overlays

The key insight: on a single machine, you cannot run 3x the full stack. But you do not need to. Most of the heavy services — vLLM (GPU), Redis, PostgreSQL, MinIO — are stateful singletons that both dev and prod can share. The application layer (Genesis, Veil, Node, Secrets) is stateless and cheap to duplicate.

                    +------------------+
                    |  Shared Backbone |
                    |  vLLM (GPU)      |
                    |  Redis           |
                    |  PostgreSQL      |
                    |  MinIO           |
                    |  Docker Proxy    |
                    +--------+---------+
                             |
              aither-shared-net (Docker bridge)
                    /                  \
        +----------+--------+  +-------+-----------+
        |   Dev Ring Overlay |  |  Prod Ring Overlay |
        |   Genesis :8001    |  |  Genesis :9001     |
        |   Veil    :3000    |  |  Veil    :3001     |
        |   Node    :8080    |  |  Node    :9080     |
        |   + 8 more svcs    |  |  + 8 more svcs     |
        +---aither-dev-net---+  +---aither-prod-net--+

Each ring overlay runs as a separate Docker Compose project with its own network. Ring containers join both their ring network (for isolation) and the shared network (for backbone access). Dev and prod cannot see each other, but both can reach vLLM and Redis.

The cost? Each ring overlay uses ~6-12 GB RAM depending on the profile. Two overlays fit comfortably on 64GB with the shared backbone.

Staging Is a Gate, Not an Environment

We initially planned three concurrent environments. Then we did the math: a third overlay would push RAM usage uncomfortably high, and staging does not need to run — it needs to validate.

Staging is a gate between dev and prod. When you promote dev to staging, the system runs health checks and tests against the dev images. If they pass, the images are tagged as staging candidates and can be promoted to prod. No third set of containers needed.

Network Isolation via Docker Compose Projects

Docker Compose projects give us free namespacing:

# Shared backbone (always running)
docker compose -f docker-compose.ring-shared.yml up -d

# Dev ring (separate project = separate namespace)
docker compose -f docker-compose.ring-dev.yml --env-file .env.dev up -d

# Prod ring (separate project = separate namespace)
docker compose -f docker-compose.ring-prod.yml --env-file .env.prod up -d

Network aliases on the shared network let ring containers use original hostnames (aitheros-genesis resolves within each ring's context). Zero code changes. Services still call AitherPorts.get_service_url("Genesis") and get the right URL because AITHER_SERVICE_HOST_PREFIX is set per ring.

Image Versioning Without a Registry

We do not run a local Docker registry. Instead, we use a tag convention:

aitheros-genesis:dev-latest      # Current dev ring
aitheros-genesis:dev-abc1234     # Dev at specific commit
aitheros-genesis:prod-latest     # Current prod ring
aitheros-genesis:prod-def5678    # Prod at specific commit
aitheros-genesis:rollback-prod   # Last known good prod

State is tracked in a simple JSON file (rings-state.json) with the current tag, commit SHA, deployment timestamp, and rollback target for each ring. Every promotion/rollback is appended to a JSONL history file and ingested into Strata for telemetry.

Three Layers of Rollback

Layer 1: Autoheal (seconds)

The existing autoheal container restarts crashed containers automatically. Handles transient failures like OOM kills or startup race conditions.

Layer 2: RingMonitor Auto-Rollback (90 seconds)

A new RingMonitor sidecar polls critical services (/health) every 15 seconds. If more than 50% of critical services (Genesis, Veil, Node, Secrets) are unhealthy for more than 90 seconds after a deployment, it triggers automatic rollback:

Tags current (broken) state for forensics
Reads the rollback tag from state
Tears down the ring (docker compose down)
Brings it back up with the last-known-good images
Verifies health, emits a Flux event

This means: deploy a broken image to prod, and the system self-heals within two minutes without human intervention.

Layer 3: Manual/Agent Rollback (on demand)

For when you need surgical control:

# CLI
python ring-ctl.py rollback prod
python ring-ctl.py rollback prod --tag prod-abc1234

# MCP tool (from Claude Code or any agent)
ring_rollback(ring="prod")

# Genesis API
curl -X POST localhost:9001/rings/prod/rollback

Promotion Flow

Promotion runs gates defined in rings.yaml before moving images between rings:

Dev ──[health + tests + lint]──> Staging (validation gate)
                                      │
                                      ├─[smoke tests]──> Prod
                                      │
                                      └─ If prod fails health → auto-rollback

The gates are pluggable. Currently:

health_check: All dev services must be healthy
tests_pass: pytest dev/tests/ --timeout=60 -x must pass
lint_clean: ruff check . (optional, not blocking)
docker_build: Images must build (already satisfied if dev is running)

For staging-to-prod, we add smoke tests and (optionally) manual approval. The --force flag bypasses gates for emergencies.

Agent-Driven Deployments

The entire pipeline is exposed as MCP tools, so our AI agents can operate it:

# Atlas checks ring health during routine patrols
ring_health(ring="prod")

# Demiurge promotes after a successful code change
ring_promote(source="dev", target="staging")
ring_promote(source="staging", target="prod")

# Any agent can check deployment history
ring_history(ring="prod", limit=5)

# Emergency rollback
ring_rollback(ring="prod")

Genesis exposes REST endpoints (/rings/*) that the MCP tools proxy to. Everything flows through the same RingManager core, whether triggered by CLI, API, or agent.

Tunnel Routing

The Cloudflare tunnel config maps hostnames to ring containers:

# Dev ring
- hostname: dev.aitherium.com
  service: http://aitheros-dev-veil:3000

# Prod ring
- hostname: demo.aitherium.com
  service: http://aitheros-prod-veil:3000
- hostname: aitherium.com
  service: http://aitheros-prod-veil:3000

dev.aitherium.com shows the dev ring. demo.aitherium.com shows prod. Both served from the same machine, completely isolated at the Docker network layer.

What We Learned

Compose projects are underrated. Most Docker Compose tutorials show one project. Running multiple projects with shared external networks gives you 80% of what namespaces provide in Kubernetes, with zero added infrastructure.

Staging does not need to be running. For a single-machine deployment, staging as a validation gate is dramatically simpler than staging as a third environment. Run the checks, tag the images, move on.

Auto-rollback changes your relationship with deployment. When you know the system will self-heal within 90 seconds, you deploy more often and with less anxiety. The safety net makes you bolder, not lazier.

Agents should operate infrastructure. Exposing ring operations as MCP tools means our agents can respond to health alerts by rolling back, or promote code after verifying test results. The human is the escalation path, not the deployment operator.

Try It

If you are running AitherOS:

# Start shared backbone + dev ring
docker compose -f docker-compose.ring-shared.yml up -d
docker compose -f docker-compose.ring-dev.yml --env-file .env.dev up -d

# Make changes, test, then promote to prod
python AitherOS/scripts/ring-ctl.py promote dev staging
python AitherOS/scripts/ring-ctl.py promote staging prod --force

# Check status
python AitherOS/scripts/ring-ctl.py status

The full implementation is in the repository: compose files, RingManager, RingMonitor, Genesis endpoints, and MCP tools. Total new code: ~1,500 lines. Total infrastructure added: zero. Just Docker Compose doing what it was designed to do.

Enjoyed this post?

All posts Try AitherOS

Back to blog

engineeringinfrastructuredeploymentdocker

Ring Deployment: How We Run Dev and Prod on One Machine Without Losing Sleep

May 19, 202611 min readAitherium