Our Backups Were Invisible: How We Audited and Fixed AitherOS Disaster Recovery
On March 15, 2026, we ran a full disaster recovery audit on AitherOS. The question was simple: if we lost this machine right now, what would we actually be able to recover?
The answer was alarming. Our backup service had been running continuously, producing backups every six hours, logging success messages -- and writing every single one of them into a Docker named volume that was completely invisible from the host filesystem. For two days, we had zero usable backups of our secrets vault, identity database, RBAC permissions, or any of the nine SQLite databases that store the system's operational state.
The Strata tiered storage backups kept running (they used a bind mount), which created the illusion that everything was fine. The backup dashboard showed recent timestamps. The logs showed no errors. But the critical data -- the stuff you actually need to recover from a disaster -- wasn't being captured.
This is the story of the audit, the root cause, and the four things we built to make sure it can't happen again.
The Architecture
AitherRecover is a 3,500-line Python service that coordinates disaster recovery across multiple storage backends. It runs inside the SecurityCore compound container alongside AitherFlux (the event bus), AitherIdentity (authentication), and AitherInspector (network analysis).
The backup pipeline has three layers:
Layer 1: PostgreSQL dumps. A sidecar container (aither-pgdump) runs pg_dumpall every 15 minutes. AitherRecover also triggers a fresh dump before every full backup. These SQL files are the only portable backup that survives a docker compose down -v. They land in data/backups/postgres/ with automatic rotation (keep last 8).
Layer 2: Local filesystem backup. Every 6 hours, AitherRecover walks a list of 30+ critical paths relative to the repo root:
DEFAULT_BACKUP_PATHS = [
"AitherOS/Library/Data", # All SQLite databases
"AitherOS/Library/Data/secrets", # vault.enc, .vault_salt, keys/, ca/
"AitherOS/Library/Data/identity", # Identity provider data
"AitherOS/lib/security/data/rbac", # RBAC JSON (users, groups, roles)
"AitherOS/data", # Memory graphs, knowledge graphs
"data/backups/postgres", # PostgreSQL SQL dumps
# ... 25 more paths
]
For each file, it calculates a SHA256 hash, copies it to the backup directory, and writes a JSON manifest. Incremental mode compares hashes against the previous manifest and skips unchanged files.
Layer 3: Strata tiered storage. AitherOS has a hot/warm/cold/lockbox storage tier system. The backup service exports data from each tier into the backup set. This includes ACTA billing database snapshots, agent workspaces, training data, and encrypted lockbox entries.
On top of all this, a DR mirror loop runs every 6 hours, encrypting per-user backup sets with Fernet (AES-128-CBC + HMAC) using user-specific keys stored in the lockbox, and pushing them to private GitHub repos.
What We Found
The Bind Mount Bug
The root cause was a single line in .DEPLOYMENT/compose/docker-compose.aitheros.yml:
# What it said:
- security-core-backups:/data/backups
# What it should have said:
- ./data/backups:/data/backups
The first form is a Docker named volume. Docker creates an isolated filesystem inside the Docker VM, completely separate from the host. The second form is a bind mount -- it maps the host directory directly into the container.
The comment above the line explained why:
# BACKUP PERSISTENCE - using named volume to avoid 9P hang
# on Docker Desktop Windows
Docker Desktop on Windows uses 9P (or virtiofs) to share the host filesystem with Linux containers. Under heavy I/O, this can hang. Someone had switched to a named volume to work around this -- a reasonable decision for performance, but catastrophic for disaster recovery. The entire point of a backup is that it exists somewhere you can reach it when the system that created it is gone.
The fix was trivial:
- ./data/backups:/data/backups
The Invisible Failure
What made this particularly dangerous was the absence of any error signal. The backup loop ran on schedule, wrote files successfully (to the named volume), logged completion messages, and reported healthy status. From inside the container, everything worked perfectly. The backups were real -- they just weren't reachable from the host.
The Strata backup service (aither-strata) happened to use a bind mount for the same directory, so its backup_* directories appeared on the host normally. When you looked at data/backups/, you saw recent timestamps and assumed everything was fine. But if you looked at the manifests, you'd see:
{
"hostname": "aitheros-strata",
"components": {
"lockbox": { "files": 299, "size": 56849726 },
"secrets": { "files": 0, "size": 0 },
"training": { "files": 2, "size": 3929 }
}
}
secrets: { files: 0 }. Every single Strata backup had zero secrets files. That's because Strata doesn't have access to the secrets vault -- that's AitherRecover's job, and AitherRecover's output was trapped in the named volume.
The Data at Risk
Here's what had no usable backup for 48+ hours:
| Data Store | Size | Impact if Lost |
|---|---|---|
vault.enc | 29 KB | Every API key, OAuth token, service credential |
.vault_salt | 16 bytes | Cannot decrypt vault without this |
identities.json | 73 KB | Service identity registry (Ed25519 keypairs) |
keys/ | 16 files | Inter-service signing keys (Ed25519) |
directory.db | 4.3 MB | All LDAP/DIT entries -- users, agents, tenants, services, certificates |
rbac.db | 100 KB | All access control -- users, groups, roles, permissions |
social.db | 88 KB | Social graph, friendships |
continual_training.db | -- | Training pipeline state, curriculum tracking |
event_queue.db | -- | ResilientQueue persistence (the irony) |
5 more .db files | -- | SMTP, expeditions, model versions, MySpace profiles |
There was also a vault.enc.CORRUPT.20260308_031137 file sitting in the secrets directory -- evidence of a vault corruption incident a week earlier. If the current vault had corrupted during those 48 hours, recovery would have required the rolling backup from February 22nd, three weeks stale.
What We Fixed
1. The Volume Mount
Changed the named volume to a bind mount. Removed the orphaned volume definition. Rebuilt the container.
# Before:
- security-core-backups:/data/backups
# After:
- ./data/backups:/data/backups
Verified by checking that files written inside the container (master_key.enc) appeared on the host immediately after restart.
2. Explicit Critical Path Listing
AitherOS/Library/Data was already in the backup path list, which technically covers all databases via recursive globbing. But "technically covers" isn't good enough for disaster recovery. We added every critical SQLite database explicitly:
# CRITICAL: SQLite databases (explicit for audit trail)
# These are also covered by AitherOS/Library/Data above, but listed
# explicitly so the backup manifest clearly shows each critical DB.
"AitherOS/Library/Data/directory.db",
"AitherOS/Library/Data/rbac.db",
"AitherOS/Library/Data/social.db",
"AitherOS/Library/Data/smtp.db",
"AitherOS/Library/Data/expeditions.db",
"AitherOS/Library/Data/continual_training.db",
"AitherOS/Library/Data/event_queue.db",
"AitherOS/Library/Data/model_versions.db",
"AitherOS/Library/Data/myspace.db",
Now when you read a backup manifest, you can immediately see whether each critical database was captured. No guessing about whether a parent directory glob resolved correctly.
3. Backup Health Monitoring
We added a /health/backup endpoint that answers the question: "Is disaster recovery actually working right now?"
@app.get("/health/backup")
async def backup_health_check():
report = backup_manager.local_engine.get_backup_health()
# Also check PostgreSQL dump freshness
...
return report
The health check inspects the latest backup manifest and reports:
- Status:
healthy,warning, orcritical - Age: How many hours since the last backup (threshold: configurable, default 12h)
- Critical file coverage: Does the manifest contain
vault.enc,.vault_salt,directory.db,rbac.db, and signing keys? - PostgreSQL dump freshness: Is the latest dump less than 1 hour old?
- Warnings: Plain-English descriptions of every issue found
{
"status": "healthy",
"age_hours": 2.3,
"staleness_threshold_hours": 12,
"total_backups": 8,
"critical_files_present": {
"vault.enc": true,
".vault_salt": true,
"directory.db": true,
"rbac.db": true,
"signing_keys": true
},
"postgres": {
"latest_dump": "dump_20260315_232938.sql",
"age_minutes": 12.4,
"healthy": true
},
"warnings": []
}
This endpoint can be scraped by monitoring systems, checked by the JarvisBrain awareness loop, or queried by agents. The critical insight is: a backup that ran successfully but produced zero files is not a successful backup. The old system couldn't distinguish between "backup completed" and "backup captured the data we need." The new system can.
4. Automatic Retention Rotation
Without rotation, backups accumulate until they fill the disk. We had 50+ ACTA snapshots from a single day in the cold tier, 10+ Strata backups from a single afternoon, and filesystem backups from early March still sitting around.
The rotation system keeps the most recent N backups (configurable via AITHER_BACKUP_RETENTION_COUNT, default 10) and deletes older ones -- both the data directory and the manifest. It runs automatically after every successful full backup:
# In the auto-backup loop, after successful full_backup():
rot = await backup_manager.local_engine.rotate_old_backups()
if rot.get("pruned", 0) > 0:
logger.info(f"[AUTO-BACKUP] Retention: pruned {rot['pruned']} old backups")
There's also a manual endpoint (POST /retention/rotate) for ad-hoc cleanup.
The Emergency Backup
While we were implementing fixes, the system was running without filesystem backups. So we ran an emergency manual backup of all critical files:
292 files, 6.2 MB total
- vault.enc, .vault_salt, keys/, ca/, vault_box/, identities.json
- 9 SQLite databases (directory, RBAC, social, SMTP, ...)
- RBAC JSON files (users.json, groups.json, roles.json)
- Identity provider data
- SHA256 checksum manifest for verification
6.2 MB. That's the total size of everything you need to reconstruct the entire identity, authorization, and secrets infrastructure of a 97-microservice operating system. It fits on a floppy disk, if you could find one.
The Test Suite
AitherRecover had zero tests before this audit. For a 3,500-line service that handles disaster recovery -- arguably the most important service in the entire system -- that's unacceptable.
We wrote 65 tests across 10 test classes:
- Backup path coverage (17 tests): Every critical database, the secrets vault, identity data, RBAC files, and postgres dumps are explicitly verified to be in the backup path list.
- Docker compose validation (3 tests): The named volume must not be used. The bind mount must exist. The environment variable must be set.
- Local backup engine (7 tests): Create backups, verify SHA256 integrity, test exclude patterns, list ordering, graceful handling of missing paths.
- Retention rotation (5 tests): Keeps correct count, deletes oldest, removes data directories, handles edge cases.
- Health check (4 tests): Healthy state, missing files warning, critical when no backups, stale backup detection.
- Source structure (14 tests): Endpoint existence, method existence, configurable parameters, auto-rotation wiring.
- Restore round-trip (2 tests): Backup and restore produces byte-identical files with matching SHA256 hashes.
- Edge cases (4 tests): Empty directories, special characters in filenames, unconfigured backup dir.
The Docker compose validation tests are particularly important. They're regression tests that will fail if anyone changes the volume mount back to a named volume -- turning a configuration review into an automated check.
Lessons
1. Backups that you can't reach aren't backups. This seems obvious, but Docker named volumes create a subtle trap. The backup runs, succeeds, writes files -- but the files exist in a virtual filesystem that's only accessible from inside the container. If the container is gone (which is exactly when you need the backup), the data might still exist in the Docker VM's storage, but good luck extracting it under pressure.
2. Validate the output, not the process. Our monitoring checked whether the backup process ran and whether it reported success. It didn't check whether the output contained the data we need. The new health check asks: "Does the latest manifest contain vault.enc?" That's a fundamentally different question from "Did the backup complete?"
3. Explicit is better than implicit. AitherOS/Library/Data covers all the SQLite databases via recursive globbing. But when you're reading a manifest during a 3 AM disaster recovery, you don't want to mentally resolve glob patterns. You want to see directory.db: { sha256: "abc123", size: 4476928 } right there in the manifest. We added every critical file explicitly, even though they were already covered by a parent path.
4. Test your disaster recovery, not just your backups. We had comprehensive backup code, automated scheduling, encryption, and multi-destination support. We had zero tests. The backup system is the last line of defense -- if it has a bug, you find out when you need it most. We now have 65 tests, including regression tests for the exact failure mode that caused this incident.
5. Docker Desktop filesystem sharing is a foot-gun. The 9P/virtiofs bridge between Windows and Linux containers has real performance issues under sustained I/O. The solution is not to use named volumes for data that must be accessible from the host. The solution is to accept the performance cost, or use a different architecture (bind mount to a dedicated fast path, or write to a network-accessible location).
The Recovery Philosophy
AitherRecover was designed around a dead-simple recovery philosophy:
- Clone the repo
- Run
bootstrap.ps1 - Point at backup:
./Start-AitherZero.ps1 -Mode Recover -BackupRepo "Aitherium/AitherBackup" - Everything restores automatically
That philosophy is sound. But it only works if the backup repo actually contains the data. After this audit, we can verify that it does -- automatically, continuously, and with clear alerts when it doesn't.
The backup system that runs silently and successfully is the most dangerous kind. It gives you confidence without evidence. Now we have the evidence.