Self-Healing Docker: One Command to Un-Wedge the WSL2 Engine
Self-Healing Docker: One Command to Un-Wedge the WSL2 Engine
If you run a serious Docker workload on Windows — especially GPU containers — you've met this failure:
dockercommands hang, then spit back500 Internal Server Error, ordocker stop/ recreate dies withtried to kill container, but did not receive an exit event, or- the named pipe just isn't there:
npipe:////./pipe/dockerDesktopLinuxEngine ... is the daemon running?
The maddening part: Docker Desktop's GUI is still up. Green whale, looks healthy. But the Linux engine inside the WSL2 VM has wedged, and nothing you do from the CLI reaches it. For us this is worse than an annoyance — when an agent is autonomously running redeploy-sync.sh and the engine wedges mid-recreate, the whole deploy stalls. So we built a one-command fix and turned it into an agent skill the system can call on itself.
Why it wedges
Docker Desktop on Windows doesn't run the daemon natively — it runs the Linux engine inside a WSL2 virtual machine (the docker-desktop distro). Your docker CLI talks to it over a Windows named pipe that forwards into that VM.
When the VM or the pipe forwarder gets into a bad state, the CLI's requests reach a half-dead engine that accepts the connection but never completes — hence the 500. The GUI is a separate Windows process, so it keeps reporting "running" while the thing underneath is gone.
We hit it hardest on nvidia-runtime / GPU containers (our aitheros-genesis orchestrator pins ~31 GB of VRAM). Tearing those down involves the GPU runtime shim, and when that hangs you get the classic "did not receive an exit event" — the container can't be killed, the recreate blocks, and the deploy is stuck. A docker restart won't fix it because the daemon itself is the casualty.
The only reliable cures used to be: reboot, or click through Docker Desktop's "Restart" / "Reset" and pray. Both are slow and destroy your container state.
The fix: kill it all the way down, in order
The recovery isn't magic — it's doing the teardown completely and in the right order, which the GUI doesn't. Our Recover-Docker.ps1 does five phases:
[1/5] Kill Docker Desktop UI + backends
Docker Desktop, com.docker.backend, com.docker.build,
docker-agent, docker-sandbox
[2/5] wsl --shutdown # kill the hung Linux VM outright
[3/5] taskkill vmmem, wslservice # reap the zombie VM processes
[4/5] restart Windows services # com.docker.service + wslservice (needs admin)
[5/5] start Docker Desktop # clean boot of the whole stack
Then it waits up to 90s polling docker info for a non-500 response, and once the engine answers it does the cleanup a fresh start won't:
# remove containers stuck in 'dead' (the un-killable ones)
docker ps -a --filter "status=dead" --format "{{.Names}}" | % { docker rm -f $_ }
# bring back everything the crash took down
docker ps -a --filter "status=exited" --format "{{.Names}}" | % { docker start $_ }
The ordering is the whole trick: kill the UI processes first (so they don't fight the restart), then wsl --shutdown to nuke the wedged VM, then reap the vmmem/wslservice zombies, then bounce the Windows services, then cold-start Docker Desktop. Skip a step or do it out of order and the engine comes back into the same bad state. Done right, you're healthy in ~30–60 seconds — no reboot, container volumes intact.
One gotcha: phase 4 (restarting
com.docker.service/wslservice) needs an elevated shell. If the script logsSkipping service restart (not admin)and Docker is still sick, re-run it from an admin PowerShell. Phases 1–3 + 5 often suffice on their own, but the service bounce is the belt-and-suspenders.
Make the machine do it: the agent skill
A script is good. A script your system invokes on itself is better. We wrapped it as a Claude Code slash command, /recover-docker:
/recover-docker # one-shot: detect → recover → verify → report
/recover-docker --monitor # watch loop: check every 30s, auto-recover on failure
The skill's whole job is recovery and nothing else: run the script, confirm with docker version + docker ps, report which containers came back, and note that a pending redeploy-sync.sh will now recreate cleanly — without rebuilding anything unless asked. The --monitor mode turns Recover-Docker.ps1 into a 30-second health watchdog that auto-recovers, so a wedge at 2 a.m. heals itself before it blocks the morning's deploys.
This is a tiny example of a bigger principle we build everything around: agents operating real infrastructure need the infrastructure to be self-healing. If standing your environment back up is a 10-step wiki page, an autonomous agent can't recover from a bad state. If it's one command — and the agent knows when to run it — the system keeps itself alive. (More on that philosophy in Build AI-First.)
Steal it
This is exactly the kind of small, sharp, reusable thing we want to give away. If you're on Docker Desktop + WSL2 + GPU, drop Recover-Docker.ps1 next to your project and bind /recover-docker (or just keep a terminal alias). Next time the whale lies to you, it's one command instead of a reboot.
This is the first of a public pack of free, MIT-licensed agent skills, scripts, and automations — recovery routines, deploy helpers, secret-safety scanners — the unglamorous glue that makes agent-run infrastructure actually work.
Get it: github.com/Aitherium/aither-skills — recover-docker (skill + script) is live now. Grab it, fork it, ship it. More landing soon.
— David Parkhurst · Aitherium · aitherium.com