Zero-Downtime Deploys on a System That Never Stops Changing
Zero-Downtime Deploys on a System That Never Stops Changing
AitherOS is a 210-service AI operating system. 119 containers running simultaneously on a single workstation with an RTX 5090. Chat, agent orchestration, model inference, code generation, image generation, voice synthesis, knowledge graphs, identity management, secret vaults, event buses — the works.
The dashboard — AitherVeil — is the operator's single pane of glass into all of it. It's a Next.js app. It gets rebuilt constantly. Sometimes multiple times in an hour. It sits behind a Cloudflare Tunnel at demo.aitherium.com and real people are using it.
This is the story of the day everything went wrong, and why nobody noticed.
The Architecture
AitherVeil's production topology is three Docker containers:
demo.aitherium.com
│
Cloudflare Tunnel
│
▼
┌──────────────────┐
│ aither-veil-lb │ nginx:alpine (port 3080)
└──┬───────────┬───┘
│ │
▼ ▼
┌────────┐ ┌─────────────┐
│ Veil │ │ Veil │
│ primary│ │ standby │
│ :3000 │ │ :3000 │
└────────┘ └─────────────┘
Primary handles all traffic. Standby is a hot spare — identical image, same network, same TLS trust chain, same Ed25519 signing identity. The LB is a 120-line nginx config that makes failover automatic and invisible.
The key trick is DNS-based failover. Here's the real config from veil-lb.conf:
resolver 127.0.0.11 valid=10s ipv6=off;
location / {
set $veil_primary http://aitheros-veil:3000;
proxy_pass $veil_primary;
error_page 502 503 504 = @standby;
}
location @standby {
set $veil_standby http://aitheros-veil-standby:3000;
proxy_pass $veil_standby;
}
Three things make this work:
-
resolver 127.0.0.11 valid=10s— Docker's internal DNS. Container IPs change after everydocker compose up -d. Nginx re-resolves every 10 seconds. No manual intervention. -
set $varbeforeproxy_pass— This forces per-request DNS resolution instead of caching the IP at config load. Without the variable, nginx resolves once at startup and sends traffic to a dead IP after a recreate. -
error_page 502 503 504 = @standby— If primary returns a connection error, nginx silently retries the identical request against standby. The user's browser never sees an error page.
The LB itself never restarts during a deploy. It uses depends_on: condition: service_started (not service_healthy). It comes up even if Veil is still booting. It's the anchor point that survives everything.
The Day Everything Went Wrong
April 10th, 2026. A user reports: "Chat says Checking..."
That "Checking..." means the frontend is polling /api/genesis/health and getting no response. Could be Genesis down. Could be the network. Could be anything.
I opened the terminal and started pulling threads. What I found was a cascading failure that had been building for hours, silently, with no alerts.
Layer 1: GPU Containers Crash-Looping
Two vLLM containers — aither-vllm (a fallback profile) and aither-vllm-coding — were crash-looping. The fallback had 101 restarts. Every time it restarted, it tried to allocate GPU memory, failed because the primary vLLM instances already held the VRAM, and crashed again. Over and over.
The coding instance was trying to download deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct on every restart. A 16GB model download attempt, 101 times.
Neither container was supposed to be running. They were leftovers from a previous configuration. Docker's restart: unless-stopped policy kept bringing them back from the dead.
Lesson: restart: unless-stopped is a loaded gun. A crashed container that shouldn't exist will consume resources forever if you don't explicitly remove it.
Layer 2: The Deleted Base Image
I tried to rebuild Pulse (the event bus and alerting service). It failed:
ERROR: failed to solve: aitheros-base:latest: not found
An earlier cleanup operation — an AI coding agent running docker system prune to free disk space — had nuked aitheros-base:latest. This is the unified base image that 90% of our Python services inherit from. Without it, nothing can rebuild.
No base image → can't rebuild Pulse → can't start the alerting system → no alerts about anything.
Lesson: Your base Docker image is infrastructure. AI agents with terminal access will happily destroy it while "cleaning up." Pin a digest. Push it to a registry. Tag it so prune won't touch it. And never give an automated tool unconstrained access to docker system prune.
Layer 3: Alerting Can't Alert
AitherOS has a ContainerDeathWatchdog that monitors critical containers and fires email alerts when they die. It runs inside Pulse.
Pulse was down because the base image was deleted.
The system that watches for failures had itself failed, and there was nobody watching the watcher. The alerting system was the single point of failure for alerting.
Lesson: Never run your death monitor inside something that can die. Or if you do, have a second, dumber watchdog — a cron job, a systemd timer, anything — that checks whether the first watchdog is alive.
Layer 4: 150 Bind Mounts and the 9P Protocol
Genesis — the kernel orchestrator, the single most important service in the stack — was accepting TCP connections but not responding to HTTP. The event loop was completely saturated.
The root cause: Docker Desktop's 9P file sharing protocol.
Every service in the compose file had two bind mounts:
volumes:
- ./AitherOS/lib:/app/lib:ro
- ./AitherOS/services:/app/services:ro
These were development conveniences — overlay the COPY'd code with the host filesystem so you can edit without rebuilding. Across all services, that's 150 bind mount lines.
On Docker Desktop for Windows, bind mounts use the 9P protocol. 9P is a network filesystem. Every file access — every import, every open(), every stat() — crosses a network boundary between the Linux VM and the Windows host. Under load, this adds milliseconds of latency to operations that should take microseconds.
Genesis runs a single uvicorn worker. Its event loop handles chat requests, agent dispatch, health checks, routine execution, and background tasks. When every import and file read is slow, the event loop can't keep up. It accepts connections (TCP) but can't process them (HTTP). From the outside, it looks like the service is up but dead.
Lesson: 9P bind mounts are poison for async Python services. COPY your code into the image. Use bind mounts only for data volumes. If you need live reload, use Docker Compose's watch feature, which rebuilds the image instead of overlaying the filesystem.
Layer 5: The Protocol Mismatch
After fixing all of the above — stopping the crash-loopers, rebuilding the base image, rebuilding Pulse, ripping out 150 bind mount lines, recreating Genesis — the chat still showed "Checking..."
Genesis was healthy. I could curl it directly and get 200 OK. But Veil's API routes were all timing out.
Veil's Node.js process was locked up. The logs were nothing but:
[ServiceSigner] AitherSecrets unavailable — retrying in 60s: TypeError: fetch failed
[ServiceSigner] AitherSecrets unavailable — retrying in 60s: TypeError: fetch failed
[ServiceSigner] AitherSecrets unavailable — retrying in 60s: TypeError: fetch failed
The ServiceSigner provisions Ed25519 signing keys from AitherSecrets. It was configured to connect via http://aitheros-secrets:8111. Every other service in the stack uses HTTPS with an internal CA. Secrets was rejecting the plaintext HTTP connection, the fetch was throwing TypeError: fetch failed (Node.js's way of saying "TLS handshake failed on a non-TLS connection"), and the retry loop was flooding the event loop.
But it wasn't just the signer. A codebase-wide grep found 37 more hardcoded http:// URLs across Veil — in API routes, middleware, audit logging, blog broadcasting, billing, every service proxy. All wrong. All should have been https://.
The fix touched 44 files. Every single hardcoded inter-service URL.
Lesson: If your services use TLS internally, enforce it everywhere. One http:// in a fetch() call can lock up an entire Node.js process when the server expects HTTPS. And use a centralized URL resolver (getServiceUrl()) instead of hardcoding URLs in route handlers.
How The Architecture Saved Us
Here's the thing: through all of this — the crash-looping GPUs, the deleted base image, the 150 bind mounts, the protocol mismatch — the load balancer never went down.
aitheros-veil-lb was up for 9+ hours straight while everything behind it burned. It's a 4MB nginx:alpine container with no dependencies on any backend service. It resolves DNS, proxies requests, and fails over. That's it.
When we finally had the fix ready, the deploy looked like this:
docker compose build aither-veil— 13 seconds (BuildKit cache hit on deps layer)docker compose up -d --no-deps aither-veil— recreate primary- Standby absorbs traffic for ~12 seconds while primary boots
docker compose up -d --no-deps aither-veil-standby— recreate standby- Both containers healthy. Zero dropped requests.
$ curl http://localhost:3080/api/genesis/health
{"status":"healthy","service":"AitherGenesis","uptime_sec":70.85,"sync_status":"live"}
The user who reported "Checking..." refreshed their browser and it worked. No maintenance window. No "we're deploying, please wait." No incident page.
The Real Nginx Config
The production veil-lb.conf handles more than just HTTP failover:
# WebSocket connections (1-hour keepalive for chat)
location /ws/ {
set $veil_primary http://aitheros-veil:3000;
proxy_pass $veil_primary;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 3600s; # 1 hour
error_page 502 503 504 = @standby;
}
# Health endpoint with dedicated standby fallback
location /api/health {
set $veil_health http://aitheros-veil:3000;
proxy_pass $veil_health;
proxy_connect_timeout 2s;
error_page 502 503 504 = @health_standby;
}
# HLS video segments — immutable, year-long cache
location ~ ^/api/stream/hls/.*\.ts$ {
set $veil_hls http://aitheros-veil:3000;
proxy_pass $veil_hls;
add_header Cache-Control "public, max-age=31536000, immutable";
}
# Live HLS from MediaMTX — no cache
location /hls/live/ {
set $mediamtx http://aitheros-mediamtx:8889;
proxy_pass $mediamtx/;
add_header Cache-Control "no-cache";
}
Three things to note:
- WebSockets get their own location block with a 1-hour read timeout. Chat sessions stay alive through normal traffic spikes. Only a full container restart drops them.
- Health endpoint has its own failover chain. External monitors (Cloudflare, UptimeRobot) only report "down" when both Veil containers are dead.
- HLS segments are immutable. Once a
.tschunk is written, it never changes, so we cache it for a year. Playlists (.m3u8) are no-cache because they update as transcoding progresses.
Resource Budget
Running two copies of a Next.js app isn't free:
| Container | CPU Limit | RAM Limit | CPU Reserve | RAM Reserve |
|---|---|---|---|---|
| Primary | 2 cores | 6 GB | 0.5 cores | 1.5 GB |
| Standby | 2 cores | 6 GB | 0.5 cores | 1.5 GB |
| LB | default | default | default | default |
The standby's 0.5 CPU / 1.5 GB reservation is the insurance premium. On a workstation running an RTX 5090 with 128 GB of system RAM, 1.5 GB for a hot spare that prevents downtime is nothing.
Both containers share:
- The same Docker image (
aitheros-veil:latest) - The same signing key volume (
data/signing) - The same blog content volume (
content/blog) - The same internal CA chain (
NODE_EXTRA_CA_CERTS)
Health Check Cascade
Each layer has tiered health checks — the standby checks faster because its job is to be ready:
| Container | Interval | Timeout | Start Period | Retries |
|---|---|---|---|---|
| LB | 10s | 5s | — | 3 |
| Primary | 60s | 15s | 90s | 5 |
| Standby | 30s | 10s | 60s | 3 |
The primary gets a 90-second start period because Next.js can take a while to compile routes on first boot. The standby gets 60 seconds. The LB gets none — it's nginx, it's ready in milliseconds.
What We Changed After This Incident
-
Ripped out all 150 bind mounts. Every
./AitherOS/lib:/app/lib:roand./AitherOS/services:/app/services:roline — gone. Services use COPY'd code from their Docker images. The compose file is 150 lines lighter and every service starts faster. -
Rebuilt the base image and documented it.
aitheros-base:latestnow gets rebuilt in CI. It won't disappear after a prune again. -
Moved the ContainerDeathWatchdog. It still runs in Pulse, but Pulse itself is now a monitored container. Docker's own healthcheck +
restart: unless-stoppedserves as the outer watchdog. If Pulse dies and can't restart, Docker Desktop's notification system catches it. -
Fixed every hardcoded URL in Veil. 44 files,
http://→https://, with a codebase-wide grep to make sure we didn't miss any. ThegetServiceUrl()function already did the right thing — it was the 37 route handlers that bypassed it. -
Added
proxy_intercept_errors onto the main location block in the LB config. Without it, nginx returns the upstream's error body instead of triggering the@standbyfallback.
What We'd Still Improve
-
Blue/green deploys with commit-SHA tags. Right now both containers use
:latest. We'd rather update standby first, verify it, then promote. The 15-second window of stale code is acceptable but not ideal. -
Graceful drain. We don't wait for in-flight SSE streams to finish before killing the old container. Docker's 10-second SIGTERM grace period handles most cases, but long agent chat sessions get cut.
-
External dead-man's switch. A 5-line cron job on the host that curls the health endpoint and sends a Pushover notification if it fails 3 times in a row. Independent of the entire Docker stack.
The Takeaway
The interesting part of this story isn't the nginx config. You can copy that in five minutes. The interesting part is that a 210-service system experienced a five-layer cascading failure — crash-looping GPUs, a deleted base image, a dead alerting system, filesystem protocol poisoning, and a TLS mismatch — and the user saw "Checking..." for a few hours instead of a crater.
That's because the load balancer is the dumbest, most reliable thing in the stack. It has no opinions. It doesn't import anything. It doesn't connect to a database. It resolves DNS, proxies bytes, and fails over on 502. When everything behind it was on fire, it kept answering the door and politely explaining that nobody was home.
Build the dumbest possible thing at the edge. Make it the last thing that dies. Ship everything else behind it as fast as you want.
Your users won't know the difference, and that's the whole point.