Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

infrastructuredevopsdockernginxzero-downtimedeploymentAitherVeilhigh-availabilitypostmortemgpuvllm

Zero-Downtime Deploys on a System That Never Stops Changing

Name: AitherOS
Author: Aitherium

April 11, 202614 min readDavid

Zero-Downtime Deploys on a System That Never Stops Changing

AitherOS is a 210-service AI operating system. 119 containers running simultaneously on a single workstation with an RTX 5090. Chat, agent orchestration, model inference, code generation, image generation, voice synthesis, knowledge graphs, identity management, secret vaults, event buses — the works.

The dashboard — AitherVeil — is the operator's single pane of glass into all of it. It's a Next.js app. It gets rebuilt constantly. Sometimes multiple times in an hour. It sits behind a Cloudflare Tunnel at demo.aitherium.com and real people are using it.

This is the story of the day everything went wrong, and why nobody noticed.

The Architecture

AitherVeil's production topology is three Docker containers:

demo.aitherium.com
      │
  Cloudflare Tunnel
      │
      ▼
┌──────────────────┐
│  aither-veil-lb  │  nginx:alpine (port 3080)
└──┬───────────┬───┘
   │           │
   ▼           ▼
┌────────┐  ┌─────────────┐
│  Veil  │  │  Veil       │
│ primary│  │  standby    │
│  :3000 │  │  :3000      │
└────────┘  └─────────────┘

Primary handles all traffic. Standby is a hot spare — identical image, same network, same TLS trust chain, same Ed25519 signing identity. The LB is a 120-line nginx config that makes failover automatic and invisible.

The key trick is DNS-based failover. Here's the real config from veil-lb.conf:

resolver 127.0.0.11 valid=10s ipv6=off;

location / {
    set $veil_primary http://aitheros-veil:3000;
    proxy_pass $veil_primary;

    error_page 502 503 504 = @standby;
}

location @standby {
    set $veil_standby http://aitheros-veil-standby:3000;
    proxy_pass $veil_standby;
}

Three things make this work:

resolver 127.0.0.11 valid=10s — Docker's internal DNS. Container IPs change after every docker compose up -d. Nginx re-resolves every 10 seconds. No manual intervention.
set $var before proxy_pass — This forces per-request DNS resolution instead of caching the IP at config load. Without the variable, nginx resolves once at startup and sends traffic to a dead IP after a recreate.
error_page 502 503 504 = @standby — If primary returns a connection error, nginx silently retries the identical request against standby. The user's browser never sees an error page.

The LB itself never restarts during a deploy. It uses depends_on: condition: service_started (not service_healthy). It comes up even if Veil is still booting. It's the anchor point that survives everything.

The Day Everything Went Wrong

April 10th, 2026. A user reports: "Chat says Checking..."

That "Checking..." means the frontend is polling /api/genesis/health and getting no response. Could be Genesis down. Could be the network. Could be anything.

I opened the terminal and started pulling threads. What I found was a cascading failure that had been building for hours, silently, with no alerts.

Layer 1: GPU Containers Crash-Looping

Two vLLM containers — aither-vllm (a fallback profile) and aither-vllm-coding — were crash-looping. The fallback had 101 restarts. Every time it restarted, it tried to allocate GPU memory, failed because the primary vLLM instances already held the VRAM, and crashed again. Over and over.

The coding instance was trying to download deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct on every restart. A 16GB model download attempt, 101 times.

Neither container was supposed to be running. They were leftovers from a previous configuration. Docker's restart: unless-stopped policy kept bringing them back from the dead.

Lesson: restart: unless-stopped is a loaded gun. A crashed container that shouldn't exist will consume resources forever if you don't explicitly remove it.

Layer 2: The Deleted Base Image

I tried to rebuild Pulse (the event bus and alerting service). It failed:

ERROR: failed to solve: aitheros-base:latest: not found

An earlier cleanup operation — an AI coding agent running docker system prune to free disk space — had nuked aitheros-base:latest. This is the unified base image that 90% of our Python services inherit from. Without it, nothing can rebuild.

No base image → can't rebuild Pulse → can't start the alerting system → no alerts about anything.

Lesson: Your base Docker image is infrastructure. AI agents with terminal access will happily destroy it while "cleaning up." Pin a digest. Push it to a registry. Tag it so prune won't touch it. And never give an automated tool unconstrained access to docker system prune.

Layer 3: Alerting Can't Alert

AitherOS has a ContainerDeathWatchdog that monitors critical containers and fires email alerts when they die. It runs inside Pulse.

Pulse was down because the base image was deleted.

The system that watches for failures had itself failed, and there was nobody watching the watcher. The alerting system was the single point of failure for alerting.

Lesson: Never run your death monitor inside something that can die. Or if you do, have a second, dumber watchdog — a cron job, a systemd timer, anything — that checks whether the first watchdog is alive.

Layer 4: 150 Bind Mounts and the 9P Protocol

Genesis — the kernel orchestrator, the single most important service in the stack — was accepting TCP connections but not responding to HTTP. The event loop was completely saturated.

The root cause: Docker Desktop's 9P file sharing protocol.

Every service in the compose file had two bind mounts:

volumes:
  - ./AitherOS/lib:/app/lib:ro
  - ./AitherOS/services:/app/services:ro

These were development conveniences — overlay the COPY'd code with the host filesystem so you can edit without rebuilding. Across all services, that's 150 bind mount lines.

On Docker Desktop for Windows, bind mounts use the 9P protocol. 9P is a network filesystem. Every file access — every import, every open(), every stat() — crosses a network boundary between the Linux VM and the Windows host. Under load, this adds milliseconds of latency to operations that should take microseconds.

Genesis runs a single uvicorn worker. Its event loop handles chat requests, agent dispatch, health checks, routine execution, and background tasks. When every import and file read is slow, the event loop can't keep up. It accepts connections (TCP) but can't process them (HTTP). From the outside, it looks like the service is up but dead.

Lesson: 9P bind mounts are poison for async Python services. COPY your code into the image. Use bind mounts only for data volumes. If you need live reload, use Docker Compose's watch feature, which rebuilds the image instead of overlaying the filesystem.

Layer 5: The Protocol Mismatch

After fixing all of the above — stopping the crash-loopers, rebuilding the base image, rebuilding Pulse, ripping out 150 bind mount lines, recreating Genesis — the chat still showed "Checking..."

Genesis was healthy. I could curl it directly and get 200 OK. But Veil's API routes were all timing out.

Veil's Node.js process was locked up. The logs were nothing but:

[ServiceSigner] AitherSecrets unavailable — retrying in 60s: TypeError: fetch failed
[ServiceSigner] AitherSecrets unavailable — retrying in 60s: TypeError: fetch failed
[ServiceSigner] AitherSecrets unavailable — retrying in 60s: TypeError: fetch failed

The ServiceSigner provisions Ed25519 signing keys from AitherSecrets. It was configured to connect via http://aitheros-secrets:8111. Every other service in the stack uses HTTPS with an internal CA. Secrets was rejecting the plaintext HTTP connection, the fetch was throwing TypeError: fetch failed (Node.js's way of saying "TLS handshake failed on a non-TLS connection"), and the retry loop was flooding the event loop.

But it wasn't just the signer. A codebase-wide grep found 37 more hardcoded http:// URLs across Veil — in API routes, middleware, audit logging, blog broadcasting, billing, every service proxy. All wrong. All should have been https://.

The fix touched 44 files. Every single hardcoded inter-service URL.

Lesson: If your services use TLS internally, enforce it everywhere. One http:// in a fetch() call can lock up an entire Node.js process when the server expects HTTPS. And use a centralized URL resolver (getServiceUrl()) instead of hardcoding URLs in route handlers.

How The Architecture Saved Us

Here's the thing: through all of this — the crash-looping GPUs, the deleted base image, the 150 bind mounts, the protocol mismatch — the load balancer never went down.

aitheros-veil-lb was up for 9+ hours straight while everything behind it burned. It's a 4MB nginx:alpine container with no dependencies on any backend service. It resolves DNS, proxies requests, and fails over. That's it.

When we finally had the fix ready, the deploy looked like this:

docker compose build aither-veil — 13 seconds (BuildKit cache hit on deps layer)
docker compose up -d --no-deps aither-veil — recreate primary
Standby absorbs traffic for ~12 seconds while primary boots
docker compose up -d --no-deps aither-veil-standby — recreate standby
Both containers healthy. Zero dropped requests.

$ curl http://localhost:3080/api/genesis/health
{"status":"healthy","service":"AitherGenesis","uptime_sec":70.85,"sync_status":"live"}

The user who reported "Checking..." refreshed their browser and it worked. No maintenance window. No "we're deploying, please wait." No incident page.

The Real Nginx Config

The production veil-lb.conf handles more than just HTTP failover:

# WebSocket connections (1-hour keepalive for chat)
location /ws/ {
    set $veil_primary http://aitheros-veil:3000;
    proxy_pass $veil_primary;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 3600s;    # 1 hour
    error_page 502 503 504 = @standby;
}

# Health endpoint with dedicated standby fallback
location /api/health {
    set $veil_health http://aitheros-veil:3000;
    proxy_pass $veil_health;
    proxy_connect_timeout 2s;
    error_page 502 503 504 = @health_standby;
}

# HLS video segments — immutable, year-long cache
location ~ ^/api/stream/hls/.*\.ts$ {
    set $veil_hls http://aitheros-veil:3000;
    proxy_pass $veil_hls;
    add_header Cache-Control "public, max-age=31536000, immutable";
}

# Live HLS from MediaMTX — no cache
location /hls/live/ {
    set $mediamtx http://aitheros-mediamtx:8889;
    proxy_pass $mediamtx/;
    add_header Cache-Control "no-cache";
}

Three things to note:

WebSockets get their own location block with a 1-hour read timeout. Chat sessions stay alive through normal traffic spikes. Only a full container restart drops them.
Health endpoint has its own failover chain. External monitors (Cloudflare, UptimeRobot) only report "down" when both Veil containers are dead.
HLS segments are immutable. Once a .ts chunk is written, it never changes, so we cache it for a year. Playlists (.m3u8) are no-cache because they update as transcoding progresses.

Resource Budget

Running two copies of a Next.js app isn't free:

Container	CPU Limit	RAM Limit	CPU Reserve	RAM Reserve
Primary	2 cores	6 GB	0.5 cores	1.5 GB
Standby	2 cores	6 GB	0.5 cores	1.5 GB
LB	default	default	default	default

The standby's 0.5 CPU / 1.5 GB reservation is the insurance premium. On a workstation running an RTX 5090 with 128 GB of system RAM, 1.5 GB for a hot spare that prevents downtime is nothing.

Both containers share:

The same Docker image (aitheros-veil:latest)
The same signing key volume (data/signing)
The same blog content volume (content/blog)
The same internal CA chain (NODE_EXTRA_CA_CERTS)

Health Check Cascade

Each layer has tiered health checks — the standby checks faster because its job is to be ready:

Container	Interval	Timeout	Start Period	Retries
LB	10s	5s	—	3
Primary	60s	15s	90s	5
Standby	30s	10s	60s	3

The primary gets a 90-second start period because Next.js can take a while to compile routes on first boot. The standby gets 60 seconds. The LB gets none — it's nginx, it's ready in milliseconds.

What We Changed After This Incident

Ripped out all 150 bind mounts. Every ./AitherOS/lib:/app/lib:ro and ./AitherOS/services:/app/services:ro line — gone. Services use COPY'd code from their Docker images. The compose file is 150 lines lighter and every service starts faster.
Rebuilt the base image and documented it. aitheros-base:latest now gets rebuilt in CI. It won't disappear after a prune again.
Moved the ContainerDeathWatchdog. It still runs in Pulse, but Pulse itself is now a monitored container. Docker's own healthcheck + restart: unless-stopped serves as the outer watchdog. If Pulse dies and can't restart, Docker Desktop's notification system catches it.
Fixed every hardcoded URL in Veil. 44 files, http:// → https://, with a codebase-wide grep to make sure we didn't miss any. The getServiceUrl() function already did the right thing — it was the 37 route handlers that bypassed it.
Added proxy_intercept_errors on to the main location block in the LB config. Without it, nginx returns the upstream's error body instead of triggering the @standby fallback.

What We'd Still Improve

Blue/green deploys with commit-SHA tags. Right now both containers use :latest. We'd rather update standby first, verify it, then promote. The 15-second window of stale code is acceptable but not ideal.
Graceful drain. We don't wait for in-flight SSE streams to finish before killing the old container. Docker's 10-second SIGTERM grace period handles most cases, but long agent chat sessions get cut.
External dead-man's switch. A 5-line cron job on the host that curls the health endpoint and sends a Pushover notification if it fails 3 times in a row. Independent of the entire Docker stack.

The Takeaway

The interesting part of this story isn't the nginx config. You can copy that in five minutes. The interesting part is that a 210-service system experienced a five-layer cascading failure — crash-looping GPUs, a deleted base image, a dead alerting system, filesystem protocol poisoning, and a TLS mismatch — and the user saw "Checking..." for a few hours instead of a crater.

That's because the load balancer is the dumbest, most reliable thing in the stack. It has no opinions. It doesn't import anything. It doesn't connect to a database. It resolves DNS, proxies bytes, and fails over on 502. When everything behind it was on fire, it kept answering the door and politely explaining that nobody was home.

Build the dumbest possible thing at the edge. Make it the last thing that dies. Ship everything else behind it as fast as you want.

Your users won't know the difference, and that's the whole point.

Enjoyed this post?

All posts Try AitherOS

Back to blog

infrastructuredevopsdockernginxzero-downtimedeploymentAitherVeilhigh-availabilitypostmortemgpuvllm

Zero-Downtime Deploys on a System That Never Stops Changing

April 11, 202614 min readDavid

Zero-Downtime Deploys on a System That Never Stops Changing

This is the story of the day everything went wrong, and why nobody noticed.

The Architecture

AitherVeil's production topology is three Docker containers:

demo.aitherium.com
      │
  Cloudflare Tunnel
      │
      ▼
┌──────────────────┐
│  aither-veil-lb  │  nginx:alpine (port 3080)
└──┬───────────┬───┘
   │           │
   ▼           ▼
┌────────┐  ┌─────────────┐
│  Veil  │  │  Veil       │
│ primary│  │  standby    │
│  :3000 │  │  :3000      │
└────────┘  └─────────────┘

The key trick is DNS-based failover. Here's the real config from veil-lb.conf:

resolver 127.0.0.11 valid=10s ipv6=off;

location / {
    set $veil_primary http://aitheros-veil:3000;
    proxy_pass $veil_primary;

    error_page 502 503 504 = @standby;
}

location @standby {
    set $veil_standby http://aitheros-veil-standby:3000;
    proxy_pass $veil_standby;
}

Three things make this work:

resolver 127.0.0.11 valid=10s — Docker's internal DNS. Container IPs change after every docker compose up -d. Nginx re-resolves every 10 seconds. No manual intervention.
set $var before proxy_pass — This forces per-request DNS resolution instead of caching the IP at config load. Without the variable, nginx resolves once at startup and sends traffic to a dead IP after a recreate.
error_page 502 503 504 = @standby — If primary returns a connection error, nginx silently retries the identical request against standby. The user's browser never sees an error page.

The Day Everything Went Wrong

April 10th, 2026. A user reports: "Chat says Checking..."

That "Checking..." means the frontend is polling /api/genesis/health and getting no response. Could be Genesis down. Could be the network. Could be anything.

I opened the terminal and started pulling threads. What I found was a cascading failure that had been building for hours, silently, with no alerts.

Layer 1: GPU Containers Crash-Looping

The coding instance was trying to download deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct on every restart. A 16GB model download attempt, 101 times.

Neither container was supposed to be running. They were leftovers from a previous configuration. Docker's restart: unless-stopped policy kept bringing them back from the dead.

Lesson: restart: unless-stopped is a loaded gun. A crashed container that shouldn't exist will consume resources forever if you don't explicitly remove it.

Layer 2: The Deleted Base Image

I tried to rebuild Pulse (the event bus and alerting service). It failed:

ERROR: failed to solve: aitheros-base:latest: not found

No base image → can't rebuild Pulse → can't start the alerting system → no alerts about anything.

Layer 3: Alerting Can't Alert

AitherOS has a ContainerDeathWatchdog that monitors critical containers and fires email alerts when they die. It runs inside Pulse.

Pulse was down because the base image was deleted.

The system that watches for failures had itself failed, and there was nobody watching the watcher. The alerting system was the single point of failure for alerting.

Layer 4: 150 Bind Mounts and the 9P Protocol

Genesis — the kernel orchestrator, the single most important service in the stack — was accepting TCP connections but not responding to HTTP. The event loop was completely saturated.

The root cause: Docker Desktop's 9P file sharing protocol.

Every service in the compose file had two bind mounts:

volumes:
  - ./AitherOS/lib:/app/lib:ro
  - ./AitherOS/services:/app/services:ro

These were development conveniences — overlay the COPY'd code with the host filesystem so you can edit without rebuilding. Across all services, that's 150 bind mount lines.

Layer 5: The Protocol Mismatch

Genesis was healthy. I could curl it directly and get 200 OK. But Veil's API routes were all timing out.

Veil's Node.js process was locked up. The logs were nothing but:

[ServiceSigner] AitherSecrets unavailable — retrying in 60s: TypeError: fetch failed
[ServiceSigner] AitherSecrets unavailable — retrying in 60s: TypeError: fetch failed
[ServiceSigner] AitherSecrets unavailable — retrying in 60s: TypeError: fetch failed

The fix touched 44 files. Every single hardcoded inter-service URL.

How The Architecture Saved Us

Here's the thing: through all of this — the crash-looping GPUs, the deleted base image, the 150 bind mounts, the protocol mismatch — the load balancer never went down.

When we finally had the fix ready, the deploy looked like this:

docker compose build aither-veil — 13 seconds (BuildKit cache hit on deps layer)
docker compose up -d --no-deps aither-veil — recreate primary
Standby absorbs traffic for ~12 seconds while primary boots
docker compose up -d --no-deps aither-veil-standby — recreate standby
Both containers healthy. Zero dropped requests.

$ curl http://localhost:3080/api/genesis/health
{"status":"healthy","service":"AitherGenesis","uptime_sec":70.85,"sync_status":"live"}

The user who reported "Checking..." refreshed their browser and it worked. No maintenance window. No "we're deploying, please wait." No incident page.

The Real Nginx Config

The production veil-lb.conf handles more than just HTTP failover:

# WebSocket connections (1-hour keepalive for chat)
location /ws/ {
    set $veil_primary http://aitheros-veil:3000;
    proxy_pass $veil_primary;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 3600s;    # 1 hour
    error_page 502 503 504 = @standby;
}

# Health endpoint with dedicated standby fallback
location /api/health {
    set $veil_health http://aitheros-veil:3000;
    proxy_pass $veil_health;
    proxy_connect_timeout 2s;
    error_page 502 503 504 = @health_standby;
}

# HLS video segments — immutable, year-long cache
location ~ ^/api/stream/hls/.*\.ts$ {
    set $veil_hls http://aitheros-veil:3000;
    proxy_pass $veil_hls;
    add_header Cache-Control "public, max-age=31536000, immutable";
}

# Live HLS from MediaMTX — no cache
location /hls/live/ {
    set $mediamtx http://aitheros-mediamtx:8889;
    proxy_pass $mediamtx/;
    add_header Cache-Control "no-cache";
}

Three things to note:

WebSockets get their own location block with a 1-hour read timeout. Chat sessions stay alive through normal traffic spikes. Only a full container restart drops them.
Health endpoint has its own failover chain. External monitors (Cloudflare, UptimeRobot) only report "down" when both Veil containers are dead.
HLS segments are immutable. Once a .ts chunk is written, it never changes, so we cache it for a year. Playlists (.m3u8) are no-cache because they update as transcoding progresses.

Resource Budget

Running two copies of a Next.js app isn't free:

Container	CPU Limit	RAM Limit	CPU Reserve	RAM Reserve
Primary	2 cores	6 GB	0.5 cores	1.5 GB
Standby	2 cores	6 GB	0.5 cores	1.5 GB
LB	default	default	default	default

The standby's 0.5 CPU / 1.5 GB reservation is the insurance premium. On a workstation running an RTX 5090 with 128 GB of system RAM, 1.5 GB for a hot spare that prevents downtime is nothing.

Both containers share:

The same Docker image (aitheros-veil:latest)
The same signing key volume (data/signing)
The same blog content volume (content/blog)
The same internal CA chain (NODE_EXTRA_CA_CERTS)

Health Check Cascade

Each layer has tiered health checks — the standby checks faster because its job is to be ready:

Container	Interval	Timeout	Start Period	Retries
LB	10s	5s	—	3
Primary	60s	15s	90s	5
Standby	30s	10s	60s	3

The primary gets a 90-second start period because Next.js can take a while to compile routes on first boot. The standby gets 60 seconds. The LB gets none — it's nginx, it's ready in milliseconds.

What We Changed After This Incident

Ripped out all 150 bind mounts. Every ./AitherOS/lib:/app/lib:ro and ./AitherOS/services:/app/services:ro line — gone. Services use COPY'd code from their Docker images. The compose file is 150 lines lighter and every service starts faster.
Rebuilt the base image and documented it. aitheros-base:latest now gets rebuilt in CI. It won't disappear after a prune again.
Moved the ContainerDeathWatchdog. It still runs in Pulse, but Pulse itself is now a monitored container. Docker's own healthcheck + restart: unless-stopped serves as the outer watchdog. If Pulse dies and can't restart, Docker Desktop's notification system catches it.
Fixed every hardcoded URL in Veil. 44 files, http:// → https://, with a codebase-wide grep to make sure we didn't miss any. The getServiceUrl() function already did the right thing — it was the 37 route handlers that bypassed it.
Added proxy_intercept_errors on to the main location block in the LB config. Without it, nginx returns the upstream's error body instead of triggering the @standby fallback.

What We'd Still Improve

Blue/green deploys with commit-SHA tags. Right now both containers use :latest. We'd rather update standby first, verify it, then promote. The 15-second window of stale code is acceptable but not ideal.
Graceful drain. We don't wait for in-flight SSE streams to finish before killing the old container. Docker's 10-second SIGTERM grace period handles most cases, but long agent chat sessions get cut.
External dead-man's switch. A 5-line cron job on the host that curls the health endpoint and sends a Pushover notification if it fails 3 times in a row. Independent of the entire Docker stack.

The Takeaway

Build the dumbest possible thing at the edge. Make it the last thing that dies. Ship everything else behind it as fast as you want.

Your users won't know the difference, and that's the whole point.

Enjoyed this post?

All posts Try AitherOS