Your House Burns Down. Your AI Remembers Everything.
We built AitherOS to be an AI operating system that learns, remembers, and evolves. But until today, all of that memory lived on a single disk. If the house burned down, Jarvis would wake up on a new machine with amnesia.
That is not acceptable.
The Problem
AitherOS runs 40+ Docker containers orchestrating everything from LLM inference to agent coordination to creative pipelines. The data that matters -- the secrets vault, conversation memory, persona state, PostgreSQL databases, training artifacts, audit logs -- all lives in Docker volumes and bind mounts on one machine.
We had local backups running (AitherRecover bundles every 6 hours, pg_dump every 15 minutes, ACTA snapshots every 10 minutes). But every copy lived on the same disk. A house fire, a theft, a catastrophic disk failure -- and everything is gone.
The goal: automated offsite backup to GitHub, zero manual steps, encrypted, every 6 hours.
What We Built
One Encrypted Tarball, Three API Calls
The original backup design walked 1,600+ files, computed SHA-256 for each, then made 2 GitHub API calls per file (GET to check existence, PUT to upload). That is 3,200+ HTTPS calls. The container would crash before finishing.
The fix is embarrassingly simple: compress everything into a single tarball, encrypt it, upload one file.
tar -czfall critical paths into a single compressed archive (~3 MB)- Fernet-encrypt it with
AITHER_MASTER_KEY(~4 MB) - Upload ONE file to GitHub via Contents API
| Metric | Before | After |
|---|---|---|
| GitHub API calls per backup | 3,200+ | 3 |
| Backup size | N/A (crashed) | 3.8 MB encrypted |
| Time to complete | Never finished | ~10 seconds |
| Files protected | 0 | 635 |
The Three Bugs That Took 4 Hours
Bug 1: Compound service startup events never fire. AitherRecover runs inside SecurityCore as a compound sub-service (FastAPI app.mount). But @app.on_event("startup") handlers do not fire for mounted sub-apps. The auto-backup loop literally never started. We fixed this by explicitly firing sub-app startup handlers in SecurityCore's lifespan context manager.
Bug 2: Token resolution during boot. get_secret("GITHUB_TOKEN") makes an HTTP call to AitherSecrets with a 0.5-second timeout. During startup, services are still booting and the call times out. The fix: check os.environ first (injected by Docker Compose), vault second.
Bug 3: pg_dumpall version mismatch. PostgreSQL 16 was running but the container only had pg client 15. pg_dumpall refuses to work across major versions. We added postgresql-client-16 from the PGDG repo to the Dockerfile.
What Gets Backed Up
The critical paths are surgical -- only tier-1 data that cannot be regenerated:
- Secrets vault (vault.enc, signing keys, CA certificates)
- Identity data (users, roles, agent identities)
- PostgreSQL dumps (tenant databases, RBAC state)
- Persona memory (Spirit, ACTA, conversation history)
- Configuration (personas, RBAC groups/roles)
- Encrypted .env snapshot (all passwords, API keys)
Everything else (model weights, codegraph embeddings, TLS certs, training checkpoints) is either re-downloadable or regenerable.
The Recovery Runbook
If everything is lost, recovery takes about 2 hours:
- Clone AitherOS from GitHub (source code survived)
- Download the encrypted backup tarball from the AitherBackup private repo
- Decrypt with your escrowed master key
- Extract, start services, restore PostgreSQL
- Reissue TLS certificates (one API call)
- Rebuild codegraph index (25 minutes)
Most of the time is Docker image pulls and model downloads from HuggingFace.
The One Thing You Must Do Manually
Escrow your AITHER_MASTER_KEY. It encrypts both the secrets vault and the backup tarball. Without it, the backup is just encrypted noise. Store it in a password manager, a safety deposit box, or somewhere it will survive whatever destroys your hardware. That is the one manual step that cannot be automated.
What is Next
The backup runs every 6 hours automatically. The next iteration will add incremental diffs, WAL streaming for PostgreSQL (5-minute RPO instead of 15-minute), and quarterly automated restore drills on a clean VM.
For now, the AI survives a house fire. That is enough for today.
Built with AitherOS. The backup system that took 4 hours of debugging compound FastAPI startup events, Docker pg_dumpall version mismatches, and GitHub API rate limits to get one 3.8 MB file uploaded.