engineeringsecurityidentityarchitecture

Identity Convergence: How We Unified Auth Across a Multi-Service Agent OS

March 8, 202612 min readAitherium

When you build an AI operating system with dozens of microservices, auth becomes the invisible tax on every feature. Every new tool, every agent dispatch, every API call needs to answer the same question: who are you, and what are you allowed to do?

AitherOS had an answer. It just had two of them. And they didn't agree.

The Problem: Two Auth Islands

AitherIdentity handled RBAC — users, roles, groups, permissions, JWT tokens, API keys. It knew who you were and what your role allowed.

AitherACTA handled billing — plans, token balances, usage tracking, Stripe/Patreon/crypto payments. It issued its own API keys with no connection to RBAC roles.

The result was predictable: services authenticated against one or the other, sometimes both, sometimes neither. The MCP server — our Claude Code integration point — would try Identity first, fall through to ACTA, get a billing key with no RBAC roles, and then silently fail with 403 on every admin endpoint.

The key worked. It authenticated. But it had no RBAC roles, so every permission check returned empty — resulting in blanket 403 denials.

The Fix: One Auth Chain

The core principle: every identity in AitherOS resolves through AitherIdentity. ACTA handles billing, but authentication and authorization flow through one system.

ACTA Registration Creates Identity Users

When someone registers through ACTA (for billing), we now ensure they exist in Identity first:

@app.post("/v1/auth/register")
async def register(req: RegisterRequest):
    # ... resolve user_id ...

    # Identity linkage: ensure user exists in AitherIdentity
    async with httpx.AsyncClient(timeout=5) as client:
        check = await client.get(f"{identity_url}/users/{user_id}", ...)
        if check.status_code == 404:
            await client.post(f"{identity_url}/users", json={
                "username": user_id,
                "roles": [PLAN_ROLE_MAP.get("explorer", "registered")],
                "user_type": "human",
            }, ...)

    # ... create ACTA balance and API key ...

When their plan changes (upgrade, Stripe webhook, Patreon tier grant), ACTA was already syncing the RBAC role — but now there's always an Identity user to sync to.

async def sync_rbac_role(user_id: str, plan: str, reason: str):
    target_role = PLAN_ROLE_MAP.get(plan)  # explorer→registered, starter→starter_role, etc.
    await client.put(f"{identity_url}/internal/users/{user_id}/roles",
        json={"roles": [target_role], "reason": f"acta:{reason}"})

MCP Service Auth: Register Once, Retry on 403

The MCP server authenticates as a service user in Identity with admin role:

def _try_identity_register() -> Optional[str]:
    resp = requests.post(f"{identity_url}/auth/service-register", json={
        "service_name": "mcp-server",
        "shared_secret": internal_secret,
        "roles": ["admin"],
    })
    return resp.json().get("api_key")

But keys go stale — containers rebuild, databases re-seed, keys expire. The old behavior: fail silently, never recover. The new behavior: automatic 403 retry with token refresh.

def _authed_request(method, url, **kwargs):
    resp = requests.request(method, url, headers=_write_headers(), ...)
    if resp.status_code == 403:
        invalidate_token()  # Clear stale key from memory + disk cache
        initialize()        # Re-register with Identity
        resp = requests.request(method, url, headers=_write_headers(), ...)
    return resp

Every MCP write operation (blog create, blog delete, blog publish) now uses _authed_request. First 403 triggers self-healing. The system never gets stuck.

Personal Access Tokens: Self-Service Scoped Keys

Service keys are for machines. Humans need something they can create, scope, and revoke themselves. We added Personal Access Tokens (PATs) to the Identity RBAC layer.

How They Work

PATs are scoped subsets of a user's permissions. An admin can create a blog:write token that can only write blog posts — nothing else. The token is stored as a cryptographic hash — never as plaintext.

The creation process validates that the requested scopes are a subset of the user's actual permissions. You cannot create a PAT that grants more access than you already have. Each token gets a configurable expiry (default 90 days).

PATs authenticate through the same get_current_user() dependency as every other auth method:

1. Try Bearer JWT → fail (PAT isn't a JWT)
2. Try X-API-Key via verify_api_key() → fail (PAT hash not in api_key_hash field)
3. Try PAT via verify_personal_token() → match! Return user + attach scopes

The Veil Self-Service UI

The /settings/api-keys page now has two sections:

Personal Access Tokens — Create scoped tokens with a checkbox grid of permissions (blog read/write, agent execute, system admin, chat, memory, files). Set expiry from 1-365 days. Revoke anytime. Works through the Identity /auth/me/tokens endpoints.

Agent Keys — The existing clearance-based system for agent identities (OBSERVER, CONTRIBUTOR, OPERATOR, ADMIN).

The key UX decision: tokens are shown exactly once at creation. After that, only the hash exists. Copy it or lose it.

Killing Hardcoded Secrets

While tracing the auth chain, we found the admin bootstrap script with a hardcoded password baked into every deployment. This was the only script needed to bootstrap a fresh instance, and it meant every deployment started with the same known credential.

We replaced it with a 4-tier resolution system:

Secrets vault — encrypted at rest, access-controlled
Environment variable — set by operator or container configuration
Persistent cache — written on first boot, read on subsequent boots
Auto-generate — 24-character URL-safe random, printed once to console

No more hardcoded passwords. Sovereign instances auto-generate on first boot and persist through the secrets vault.

Auto-Rotating Service Keys

AitherSecrets already had a manual key rotation endpoint (POST /rotate/{service_name}). What it didn't have was a scheduler. Service keys would live forever unless someone remembered to rotate them.

We added a background loop to the Secrets service lifespan:

async def _key_rotation_loop():
    rotation_days = int(os.getenv("AITHER_KEY_ROTATION_DAYS", "90"))
    while True:
        for svc_name, identity in mgr._identities.items():
            age_days = (now - identity.created_at).days
            if age_days >= rotation_days:
                await mgr.rotate_service_key(svc_name, sync_to_github=False)
        await asyncio.sleep(6 * 3600)  # Check every 6 hours

Every service signing key older than 90 days gets rotated automatically. Old keys are backed up. No human intervention needed.

The Full Auth Chain Today

Here's how a request flows through the unified system:

Browser user (SSO):

Cloudflare Zero Trust → SAML → AitherIdentity IdP → JWT → Veil middleware
→ X-User-Roles header → route handler → authorized

API user (PAT):

X-API-Key: aither_pat_xxx → Veil middleware → Identity /auth/me
→ verify_personal_token() → user + scopes → authorized

MCP tool (service key):

mcp_auth.get_auth_headers() → cached Identity API key
→ X-API-Key header → Veil blog route → Identity validation → authorized
→ on 403: invalidate → re-register → retry → authorized

ACTA billing user:

Register → ACTA creates balance + API key → Identity user created
→ Pay → plan_change → sync_rbac_role() → Identity role updated
→ All permissions flow from one RBAC source

What We Learned

Auth is a graph, not a list. The moment you have two services issuing tokens, you have a consistency problem. ACTA and Identity each worked perfectly in isolation. The failure was in the gap between them — a user could authenticate with a billing key that had no RBAC roles.

403 is not a terminal state. Services restart, containers rebuild, databases re-seed. A 403 today might work tomorrow. The self-healing retry pattern (invalidate → re-register → retry) turns transient auth failures into transparent recovery.

Scoped tokens are not optional. The choice isn't between "full access" and "no access." PATs let users create exactly the credential they need — a CI key that can only deploy, a monitoring token that can only read, a blog key that can only publish. Least privilege becomes self-service.

Secrets in source code are technical debt with compound interest. That hardcoded admin password existed because it was expedient. It would have been expedient in every deployment, on every fork, in every security audit failure. The 4-tier resolution adds complexity at one point to eliminate risk everywhere.

The auth system now has one source of truth. AitherIdentity decides who you are and what you can do. Everything else — billing, MCP tools, the blog, agent dispatch — asks Identity and trusts the answer.

The changes described in this post span 6 files across 3 services: AitherIdentity (RBAC + PATs), AitherACTA (Identity linkage), AitherSecrets (key rotation), MCP auth (self-healing), MCP blog tools (403 retry), and the Veil API keys page (self-service UI). All endpoints are tested end-to-end: service registration, PAT creation, PAT authentication, blog write through PAT, credential vault CRUD, and MCP tool auth recovery.

Enjoyed this post?

All posts Try AitherOS