How a Zombie Process Took Down Our Login Page — And How We Made It Impossible
At 2:47 PM on a Wednesday, a user reported that the login page was broken. No GitHub button. No Google SSO. No passkeys, no WebAuthn, no LinkedIn. Just a bare email/password form staring back at them. Some users got it worse — a raw "Internal server error" with no login page at all.
We run 65+ Docker containers across 12 architectural layers. The login page is the front door. It should be the last thing to break. It was the first thing we noticed.
The root cause was a zombie process in AitherSecrets — our encrypted vault service on port 8111. But the path from "vault service is undead" to "login page is broken" crossed four service boundaries, three timeout thresholds, and a port number that had been wrong for weeks. This is that story.
The symptoms
The first clue was the login page itself. Instead of the usual grid of authentication options — GitHub OAuth, Google SSO, LinkedIn, passkeys, WebAuthn, TOTP — users saw a stripped-down fallback:
{
"password": true,
"email_otp": true,
"totp_2fa": false,
"webauthn": false,
"passkeys": false,
"security_keys": false,
"saml_sso": false,
"oauth_providers": { "github": false, "google": false, "linkedin": false },
"cloudflare_access": false
}
That's the hardcoded default. The login page renders it when it can't reach SecurityCore to ask what authentication methods are actually available. Something was preventing that call from completing.
The second clue was in the Veil container logs:
[ServiceSigner] AitherSecrets unavailable — retrying in 60s: AbortError: The operation was aborted due to timeout
[ServiceSigner] AitherSecrets unavailable — retrying in 60s: AbortError: The operation was aborted due to timeout
[ServiceSigner] AitherSecrets unavailable — retrying in 60s: AbortError: The operation was aborted due to timeout
Dozens of these. Every five seconds, another timeout. AitherSecrets wasn't responding.
The third clue was SecurityCore's health endpoint taking fourteen seconds for a check that normally returns in 20 milliseconds. SecurityCore was alive but drowning.
Finding the zombie
The Secrets container was running but its main process was dead:
USER PID %CPU %MEM VSZ RSS STAT COMMAND
root 1 0.0 0.0 2396 1536 Ss /bin/sh -c python AitherSecrets.py
root 7 0.0 0.0 0 0 Z [python] <defunct>
PID 7, status Z — a zombie. The Python process had died, but its parent (PID 1, the shell) hadn't reaped it. The process was dead but wouldn't go away. It held no file descriptors, responded to no signals, served no requests. It was just... there. Undead. Blocking the container's health check from ever passing, but not crashing hard enough for Docker to restart it.
We couldn't even restart the container cleanly:
$ docker restart aitheros-secrets
Error response from daemon: Cannot restart container 3d6f1237f3d7:
container PID 7018 is zombie and can not be killed.
Use the --init option when creating containers to run an init inside
the container that forwards signals and reaps processes
Docker told us the fix right in the error message. We just hadn't been listening.
The cascade
Here's the chain, link by link.
Link 1: Zombie Secrets kills signedFetch
AitherOS uses Ed25519 request signing for all inter-service HTTP calls. Every service gets a keypair from AitherSecrets at boot, caches it locally, and signs outbound requests. The receiving service verifies the signature. This is our zero-trust interior — no service trusts another without cryptographic proof.
AitherVeil (our Next.js frontend) uses signedFetch — a drop-in replacement for fetch() that injects Ed25519 signing headers on every request. The problem is that signedFetch calls getKeyPair(), which checks the in-memory cache, then tries AitherSecrets if the key is stale or missing — with a five-second timeout.
When Secrets is a zombie, this call doesn't fail fast — it hangs for the full five seconds, every single time. The circuit breaker catches the failure and skips retries for 60 seconds, but by then the damage is done.
Link 2: Stalled auth checks flood SecurityCore
Veil's middleware runs on every protected route. It checks the user's session by calling SecurityCore's Identity endpoint via signedFetch. Every page load, every API call, every protected route — they all hit this. With signedFetch stalling for 5 seconds per call, requests started queuing. Dozens of concurrent auth checks, each waiting 5 seconds for signing keys, then finally sending the request to SecurityCore all at once.
SecurityCore (port 8115) is a compound service — it runs Identity, Recover, and Inspector in a single container. It's built to handle normal auth traffic. It's not built to handle a thundering herd of requests that all arrive simultaneously after a 5-second synchronized stall.
Link 3: Overloaded SecurityCore drops auth/methods
The login page needs to know what authentication methods are available. These are lightweight calls that normally return in under 50ms. But SecurityCore's event loop was saturated with the auth-check flood. The login page had a 3-second timeout on these calls. Three seconds wasn't enough.
Link 4: Timeout triggers the fallback
When the auth/methods call fails, the login page falls back to hardcoded defaults. Every OAuth provider: false. Every modern auth method: disabled. The user sees a login form from 2015.
The full chain
AitherSecrets zombie (port 8111)
-> signedFetch stalls 5s waiting for signing keys
-> auth checks queue up in Veil middleware
-> thundering herd hits SecurityCore (port 8115)
-> SecurityCore event loop saturated
-> login page auth/methods call times out (3s)
-> fallback: OAuth/SSO buttons disappear
-> users can't log in
One zombie process. Four service boundaries. A broken login page.
The bonus bug: a phantom port
While debugging, we found something else. Several Veil source files still referenced SecurityCore on port 8117 — not 8115. Port 8117 is Flux — the event bus. It used to live inside the SecurityCore compound, but we extracted it to a standalone service weeks ago. When Flux moved out, SecurityCore stayed on 8115, but several references never got updated. Some auth calls were hitting Flux instead of Identity. Flux has no /identity/auth/me endpoint, so those calls returned connection refused or 404.
This wasn't causing the zombie cascade, but it was making things worse. Some requests failed instantly (wrong port), others stalled for 5 seconds (signedFetch timeout), and the combination made debugging significantly harder because the errors weren't consistent.
The fixes
Five changes. Each one addresses a different failure mode.
Fix 1: init: true — kill zombies at the source
This is the one-liner that would have prevented the entire incident.
x-restart-policy: &restart-policy
restart: unless-stopped
init: true # tini PID 1 — reaps zombie processes, forwards signals
Docker's default PID 1 behavior is broken for production workloads. When you run a container, Docker sets your entrypoint command as PID 1. But PID 1 in Linux has special responsibilities: it must reap zombie child processes and forward signals to child processes. A Python script doesn't do either of these things.
init: true tells Docker to inject tini as PID 1. Tini is a tiny init system (< 1MB) that does exactly two things: reap zombies and forward signals. It's Docker's built-in solution, shipping with every Docker installation since 2017. You just have to turn it on.
With init: true in our shared restart policy anchor, every container in our 65+ service fleet gets tini as PID 1. The AitherSecrets zombie would have been cleaned up in milliseconds instead of taking down the login page.
Fix 2: Autoheal — self-healing containers
Even with zombie prevention, containers can go unhealthy for other reasons — memory leaks, deadlocks, corrupted state. We already had Docker healthchecks on every service. But healthchecks without auto-remediation are just monitoring. They tell you something is wrong; they don't fix it.
autoheal:
image: willfarrell/autoheal:latest
environment:
AUTOHEAL_CONTAINER_LABEL: all
AUTOHEAL_INTERVAL: 30
AUTOHEAL_START_PERIOD: 120
DOCKER_HOST: tcp://aitheros-docker-proxy:2375
Autoheal monitors Docker healthchecks every 30 seconds. When a container goes unhealthy, it auto-restarts it. It connects through our Docker socket proxy — no container ever touches the raw Docker socket.
With a 30-second check interval, Autoheal would have caught the unhealthy Secrets container and restarted it long before the cascade reached the login page.
Fix 3: Auth methods caching
The login page was making a fresh call to SecurityCore on every render. Auth methods don't change at runtime. There's no reason to hit SecurityCore on every single login page load.
We added a 1-minute in-memory cache and increased the timeout from 3 seconds to 15 seconds. The aggressive 3-second timeout was a double-edged sword — it failed fast under normal conditions, but it also failed during exactly the scenarios where a little patience would have succeeded.
Fix 4: Port correction (8117 to 8115)
SecurityCore runs on 8115. Flux runs on 8117. They were the same container once; they aren't anymore. Fixed across all compose files and Veil source.
Fix 5: Backward-compatible Identity mount
Added a dual mount in SecurityCore so Identity routes are accessible at both /auth/* (new, root-mounted) and /identity/auth/* (legacy path that Veil routes still use). This prevented 404s during the transition.
What we learned
Signed inter-service calls add a hidden dependency. Ed25519 request signing is great for zero-trust security. But it means every service has a runtime dependency on the key provider. When AitherSecrets goes down, it's not just "the vault is unavailable" — it's "every signed HTTP call in the system stalls until the timeout fires."
Docker's default PID 1 is not production-ready. Your application process is not an init system. It doesn't reap zombies. It doesn't forward signals. init: true should be the default in every docker-compose.yml you write. It's one line. It prevents an entire class of failures.
Healthchecks without auto-remediation are just monitoring. We had healthchecks on every container. They correctly detected that Secrets was unhealthy. And then nothing happened. Autoheal closes the loop: detect, restart, move on.
Aggressive timeouts are a double-edged sword. The 3-second timeout on the login page's auth/methods call was intended to fail fast. And it did — it failed fast even when a 5-second wait would have succeeded. We bumped it to 15 seconds and added caching so the timeout rarely fires at all.
Cascade failures cross layers you didn't think were connected. A vault service (Layer 0 infrastructure) took down a login page (Layer 10 UI). The path went through request signing, middleware auth checks, and API timeouts. No single team would own this entire chain.
Checklist for your own stack
If you run microservices in Docker, check these today:
- Do all your containers have
init: true? If not, you're vulnerable to zombie processes. - Do you have auto-remediation for unhealthy containers? Healthchecks without auto-restart are just expensive logging.
- Does your inter-service auth have a dependency on a single key provider? If that provider goes down, how long do your other services stall?
- Are your login page API calls cached? Auth methods don't change at runtime. Don't make a live call on every page render.
- Are your timeouts appropriate for degraded conditions? A 3-second timeout that works perfectly under normal load might be too aggressive when the system is under stress.
One zombie process. Four cascading failures. Five fixes. Zero zombies going forward.