Who Watches the Watchdog? Building Systems That Heal Themselves
A machine rebooted. When it came back, almost everything came back with it — but one small service didn't. It wasn't crashed. It wasn't erroring. It was sitting in a state that looked, to every automated safety net we had, like "nothing to see here." A bigger service depended on it, so that one quietly refused to start too. And the part of the system whose entire job is to notice this kind of thing? It was online, green, healthy — and completely silent.
No alert. No recovery. Just a slow, invisible degradation that a human happened to trip over later.
This is the most useful kind of failure, because it doesn't expose a bug in your code. It exposes a flaw in how you think about failure. So we wrote down what it taught us. None of it is exotic. All of it is the stuff that's easy to skip until it bites you.
The watchdog can't share fate with the thing it watches
Our monitor lived inside one of the core services. Convenient — it already had all the context, all the connections, all the data. Right up until that core service was the thing that went down. Then the monitor went down with it, and a monitor that's down doesn't send alerts about being down.
This is the oldest trap in operations, and it's seductive every single time: you put the alarm inside the building it's supposed to protect. The fix is boring and non-negotiable — the observer has to be able to survive the thing it observes. It runs separately. It depends on as little as possible. It has its own way out. If your health-checker and your application die in the same breath, you don't have monitoring; you have a status light that's wired to the same fuse as everything else.
Recovery only covers the failure modes you imagined
Every automatic-restart mechanism we had was built around one mental model: a thing runs, the thing crashes, you start it again. Reasonable. It's also only one way for something to be down.
The service that stranded us never crashed. It never ran in the first place — it got created and then, mid-startup, the rug got pulled. To the restart logic, "crashed" and "never started" are completely different worlds, and ours only knew how to rescue the first one. So the container sat there, technically present, practically useless, and every layer of automation looked right past it because it didn't match the shape of failure they'd been taught to expect.
The lesson: enumerate the states, not just the events. "Down" is not a single thing. Down can mean crashed, never-started, stuck-waiting-on-a-dependency, running-but-not-answering, or up-and-lying-about-being-healthy. Recovery that only handles the dramatic failure (the crash) and ignores the quiet one (the no-show) will let the quiet one take you out.
The alert has to travel a different road than the failure
Here's the subtle one. Even once we had an independent watcher, its alerts still went nowhere — because the channels it tried to use were either never configured or, worse, routed through the very infrastructure that was down. An alert that depends on the healthy system to deliver the news that the system is unhealthy is a contradiction. It only works when you don't need it.
So the rule we now hold ourselves to: the notification path must survive the outage it's reporting on. Pick a delivery route that doesn't share components with the thing most likely to fail. Have more than one. And — this is the part everyone skips — actually send a test alert and confirm it lands in a human's hands. An untested alerting path is not an alerting path. It's a hope.
Recovery has to be safe, or you'll trade an outage for a storm
There's a reason teams disable auto-restart after they get burned by it once: a naive "if it's down, start it" loop can turn a single sick service into a thrashing restart storm that's worse than the original problem. We'd been there. The reaction — turning recovery off entirely — is understandable and wrong. You don't fix a chainsaw by never using it.
Good auto-recovery has guardrails baked in:
- It's bounded. Try a few times, then stop and escalate to a human. Infinite retries are how you melt things.
- It backs off. Don't hammer. Give the thing room to actually come up.
- It has an exclusion list. Some components are too sensitive to be restarted automatically — the ones where a restart is expensive or destabilizing. Those get watched and reported, never auto-poked. Recovery is a scalpel, not a reflex.
- It's idempotent and direct. It acts on the actual current state, and it does so without routing through the dependency that might be the thing that's broken.
Recovery you can trust is recovery you've made boring — predictable, capped, and impossible to turn into a self-inflicted wound.
You don't have instrumentation until it's caught something real
This is the heart of it. It is very easy to build a beautiful dashboard full of green checkmarks and call your system "observable." Green dashboards are not evidence of observability. They're evidence that nothing has gone wrong yet, or that your instruments aren't pointed at the things that break.
The only real test is adversarial: break something on purpose and watch what your system does about it. Does it notice? Does it tell someone, through a channel that works? Does it try to fix it, safely, and know when to give up and ask for help? Until you've watched your instrumentation handle a genuine failure end to end, you don't actually know what it does — you know what you hope it does. Those are different, and the gap between them is exactly where 3 a.m. outages live.
So the procedure we settled on is less a tool and more a habit:
- Separate the observer. It survives what it watches, depends on little, and has its own escape hatch.
- Model failure as states, not events. Crashed, stuck, never-started, lying-about-health — cover all of them.
- Make the alert path independent and verified. Different road than the failure; tested with a real message to a real person.
- Make recovery safe. Bounded, backed-off, exclusion-aware, direct.
- Prove it by breaking it. Adversarial drills, not green dashboards.
The payoff
Not long after we put all of this in place, the same class of failure happened again — a core service got stranded in that exact "present but never started" state. This time the independent watcher saw it within seconds, started it directly, confirmed it came back, and recovered the whole dependency chain in about half a minute. No human. No silent hours. No one tripping over it later.
That's the whole point. The goal was never a prettier dashboard. The goal was a system that, when one of its own pieces quietly falls over at the worst possible moment, notices, tells someone, and — carefully, within limits it understands — picks the piece back up itself.
You earn that by assuming you will fail, designing the failure paths as deliberately as the happy ones, and then proving they work by breaking your own things on purpose. Reliability isn't the absence of failure. It's what your system does about failure when no one is watching.