The Login Service That Couldn't Survive Its Own Restart
It began the way most infrastructure stories begin: with a human being mildly annoyed.
You click Sign in with AitherOS on grafana.aitherium.com. You get bounced to our identity provider at idp.aitherium.com. You type your email, hit Send login code, and… nothing. The button just sits there. No spinner, no "sending," no acknowledgment that the universe registered your intent. So you do what any reasonable person does — you click it again. And again. Then your inbox lights up with six identical "Your AitherOS login code" emails, you pick one, paste it in, and the page says Invalid code. Please try again. You finally find the right code, paste that, and Grafana greets you with Login failed: Failed to get token from provider.
That was the bug report. It came with screenshots. It was, fairly, described as "a few issues."
What looked like a flaky button turned out to be the visible edge of a much deeper problem: our auth service kept all of its short-lived login state in process memory. That's fine when there's exactly one process. It is quietly catastrophic the moment you want two. This is the story of fixing the button, discovering the real disease, rebuilding the identity provider to be highly available — and the genuinely rocky deploy that followed, including the part where I made the outage worse before I made it better.
Part 1: the button really was dead (sort of)
The first symptom was the easiest to explain and the most satisfying to fix.
The IdP login page is server-rendered HTML. "Send login code" was a plain form submit, and the handler sent the email synchronously — over our Cloudflare tunnel, through the mail path — before it rendered the next page. That's one to two seconds where the button looks completely inert. There was no client-side feedback at all: no disable, no "Sending…," nothing.
So an impatient human clicks again. Every click mints a fresh one-time code and a fresh opaque token, and the page you land on is bound to the latest token. Enter a code from an earlier email and the hash doesn't match the token the page is holding → "Invalid code." Not a backend failure. Just a UI that never said it was working, feeding a flood it then couldn't reconcile.
The fix was a few lines of inline JavaScript: on submit, disable the button, show "Sending code…", and guard against double submission. Same treatment for the "Verify" button so a double-click can't burn an attempt. We shipped it, restarted the auth service, confirmed the new markup was live, and the code-spam cascade was gone.
If the story ended there it wouldn't be worth writing about. But "Failed to get token from provider" was still lurking, and it pointed somewhere far more interesting.
Part 2: the disease, not the symptom
Our identity provider — internally aitheros-security-core, which serves every /identity/* and /auth/* route — was running as a single container with a single worker. The public hostname routed straight to it. There was no standby and no load balancer, unlike our other critical services.
That single-process design wasn't an accident; it was load-bearing. We went looking for why it could only run one worker, and the answer was everywhere: the service kept its ephemeral auth state in in-memory dictionaries.
A quick inventory turned up around seventeen of them:
- OIDC authorization codes and access tokens
- OIDC browser-flow state (the thing that ties an authorize request to the eventual code)
- Email one-time codes, across three different login surfaces
- SSH login challenges, 2FA step-up challenges, device-code flow
- Magic-link tokens, WebAuthn/passkey ceremony state, SAML request and IdP-flow state
- And every rate limiter — login attempts, TOTP attempts, OTP send throttles
Every one of those is a promise that "the node that issued this is the same node that will redeem it." Run two replicas behind a load balancer and that promise breaks on the first request. Issue an OIDC code on replica A, send the token exchange to replica B, and B has never heard of it → Failed to get token from provider. Request an OTP on A, verify on B → Invalid code. The exact failures from the bug report, now with a mechanism.
The rate limiters were their own quiet scandal: with state per-process, two replicas means a brute-force attacker gets twice the attempts before lockout. Each replica you add makes your throttle weaker.
So this was never a one-file patch. It was an architecture project: externalize the state, then HA becomes possible.
Part 3: the design
We had a strong constraint going in — reuse what already exists, don't bolt on new infrastructure. The codebase already had a Redis (aitheros-redis), already had a shared SQLite directory store on a bind-mount that every replica could see, and already had a battle-tested HA pattern for two other services (an active-active pair behind an nginx load balancer). We leaned on all three.
Short-TTL, high-churn state → Redis. A new AuthStateStore wraps redis.asyncio with exactly the semantics this domain needs: a namespaced TTL key/value, an atomic read-and-delete (GETDEL) so a single-use code can't be redeemed twice even if two replicas race, and a global fixed-window counter for rate limits. OIDC codes/tokens/flows, email OTP, and the device-code flow moved here. Critically, when the store is required (the HA replicas set a flag) and Redis is unreachable, the node reports itself degraded rather than silently falling back to per-process memory — because a silent fallback is exactly how you reintroduce the bug you're trying to kill.
Sync-path state → the shared directory store. A lot of the auth state lives in synchronous code — the hot authenticate() path, the provider classes. Routing those through an async Redis client would have forced an async ripple through the whole call graph. Instead we added a couple of small tables (and one generic namespaced TTL key/value table) to the existing SQLite directory store, which is already shared across replicas via a bind-mount and already synchronous. SSH challenges, 2FA challenges, the TOTP limiter, magic links, WebAuthn ceremonies, and SAML flows all went there. No async ripple, full cross-replica sharing.
Background loops → leader election. Two replicas running the same code means two reminder schedulers and two backup loops. A small LeaderLock — a renewable Redis lease — makes exactly one replica the leader. The reminder scheduler and the backup loops check is_leader before doing work. It's fail-closed: no lock, no run, so you never double-send.
The topology. A standby replica and a TLS-terminating nginx load balancer, active-active, 50/50, with fast failover to the peer. We moved the service's network aliases onto the load balancer so every existing consumer — internal services, Grafana, the public tunnel — flowed through HA with zero configuration changes on their side.
And Redis itself. A single Redis is a single point of failure, and we'd just made it load-bearing for auth. So we finished the Sentinel setup that was half-present: added the replica that Sentinel needs to actually have something to promote, put Sentinel and the replica in the running profiles, and taught AuthStateStore and LeaderLock to discover the master through Sentinel and follow failover.
By the time the code was done, 139 tests were green — including one that splits six failed login attempts across two scorer instances and asserts both of them see all six. (That test exists because the first time through, I left the velocity counter — a brute-force control — per-replica and waved it off as "eventually consistent." That was wrong, a reviewer said so in plain terms, and the fix was to do it properly and prove it.)
Part 4: the deploy, in which I am humbled
Here is the part the tidy architecture diagrams leave out.
I went to deploy and found aitheros-security-core was already unhealthy — every /identity/* route 404ing — before I'd touched production. I did the obvious thing: a clean stop-and-start. And the auth service went from "degraded" to "completely down."
It took an embarrassing amount of careful log-reading to understand why, and the answer is a great Docker gotcha. The container had its AitherIdentity.py bind-mounted live from the repo — so it was running my updated code — but the image it ran inside was stale, built before any of my work. My updated file imports the new modules (AuthStateStore, LeaderLock, redis_factory) at the top. Those modules existed on disk and in my tests, but they were not baked into that old image. So the import failed, the Identity sub-app never mounted, and every route under it 404'd. The bind-mount let new code meet an old image, and the mismatch was invisible until exactly the wrong moment. The only fix was to rebuild the image — no amount of restarting can conjure a module that isn't there.
Then, while trying to recover, my stop/start landed the fresh container in a state with no network attachment at all. I chased that as a WSL2 Docker wedge — the logs even showed a seven-hour clock skew, which turned out to be a logger artifact, not a real clock problem, a red herring that cost me a good twenty minutes. The real cause was more embarrassing: when I'd written the compose file, I gave the primary and standby debug host ports 8116 and 8117 — ports that aitheros-memory-core and aitheros-flux already owned. The port collision is what blocked the container from binding and attaching its network. I'd introduced the very wedge I was diagnosing. The fix was to delete those debug ports entirely; only the load balancer needs to publish one.
I rebuilt the image, removed the bad ports, force-recreated the security group — primary, load balancer, Redis replica — and the primary came up healthy on the new image. Then the load balancer crash-looped. nginx, very precisely, told me why: a named location with a static proxy_pass that includes a URI path is a configuration error. My health-check failover block had exactly that. I deleted the special-case block (the health route flows through the main location just fine, with the same failover), restarted the LB, and nginx came up clean.
Three real bugs — a stale-image/bind-mount mismatch, a port collision, an nginx config error — each found the hard way, in production, with auth down. None of them were in the 139 tests, because none of them were code. They were the gap between "the code is correct" and "the system is running."
Part 5: green
Once those three were fixed, everything fell into place, and we got to do the verification that actually matters — not "does it return 200," but "is the new machinery really doing what we claimed":
- I triggered an OIDC authorize through the load balancer and then looked inside Redis. There it was:
auth:oidc:flow:2t2MP1…. The flow state was in the shared store, redeemable from either replica. The original bug class was dead. - Both replicas served the same JWKS signing key (
kid=69ec980e…), so a token minted on one validates on the other. - The Redis replica reported
master_link_status:up, and Sentinel reportednum-slaves:1— real failover, not a decoration. LeaderLockhad elected a single leader; the lease sat in Redis asauth:leader:security-singletons.- And
https://idp.aitherium.com/identity/auth/methods, the full public path through Cloudflare → tunnel → load balancer → a healthy replica, returned 200.
We finished the long tail too: synced the tunnel route so the public hostname points at the load balancer, recreated Grafana so its token exchange goes through HA, and rebuilt the watchdog that guards the stack so it knows these two replicas are a pair and will never take both down at once.
What I'd tell you to take from this
In-memory state is a single-replica decision, even when you didn't decide it. Every self._pending = {} is a quiet vote against horizontal scaling. They're invisible until the day you want a second process, and then they all come due at once.
A degraded fallback that "still works" is how outages hide. The most important design choice in the whole project was making the store report degraded when its shared backend is gone, instead of silently reverting to per-process memory. Silent fallbacks don't prevent the failure; they postpone it to a worse moment and remove the signal you'd have used to catch it.
Tests prove the code; they say nothing about the system. 139 green tests and three production-only bugs is not a contradiction — it's the normal shape of things. A stale image, a port someone else owned, an nginx directive — none of those live in your unit tests, and all of them will stop your service cold.
Write down the gotchas while they're fresh. The stale-image-versus-bind-mount trap, the port collision, the named-location rule — they're in our engineering memory now, the same place this post came from, so the next person (possibly me, in three weeks) doesn't rediscover them at 3 a.m. with auth down.
The login flow that couldn't survive its own restart now survives the loss of a whole replica, and the loss of the Redis master, without a human noticing. It took a dead-looking button to find the thread, and a genuinely humbling afternoon to pull it. Both were worth it.