Back to blog
engineeringsecurityarchitectureidentity

AitherDirectory: One Tree to Rule Them All — Building a Unified LDAP Directory for an Agent OS

March 9, 202614 min readDavid Parkhurst

AitherOS has 97 microservices, 29 agent personas, multi-tenant RBAC, certificate management, SMTP routing, and a device mesh. Every one of those systems needed to answer the same question: who is this, what can they do, and where do they belong?

The answers were scattered across seven different data stores. Users lived in JSON seed files (promoted to SQLite, optionally to PostgreSQL). Agents lived in 16 YAML identity files. Tenants lived in the tenant service's in-memory state. Services lived in the central service registry. Certificates lived in the certificate service's internal store. SMTP config lived in the mail service's local state. A2A agent cards lived in their own configuration file.

If you wanted to answer "show me everything about agent Athena" — her identity, her capabilities, her A2A card, her email address, her role assignments, her certificate bindings — you had to query five different systems.

We needed a directory.

Why Not Just Use LDAP?

The instinct is obvious: spin up OpenLDAP, define a schema, point everything at it. But AitherOS has some constraints that make off-the-shelf LDAP painful:

1. We're fully containerized. OpenLDAP in a container is fine until you need to coordinate its startup with 23 compound services that need identity resolution during boot. Our infrastructure layer boots first — logging, secrets, networking, telemetry. The directory needs to be there before anything else starts asking "who am I?"

2. Our schema is weird. Standard LDAP object classes (inetOrgPerson, groupOfNames) don't have attributes for aitherEffortBudget, aitherA2ACapabilities, or aitherStorageRole. We'd need a custom schema overlay — which means maintaining .schema files, dealing with OID registration, and fighting slapd every time we add an attribute.

3. We need writes to be fast and local. RBAC checks happen on every API call. If those checks need to round-trip to an LDAP server, we've just added latency to every request in the system. We need in-process reads with sub-millisecond lookup.

4. External LDAP sync is a "nice to have," not a primary use case. Enterprise customers might want to sync from Active Directory. But our primary directory is AitherOS itself — the agents, services, and tenants that make up the mesh.

The answer: build our own directory store with an LDAP protocol adapter bolted on top.

The Directory Information Tree

Every identity object in AitherOS now lives in a single tree rooted at dc=aither,dc=os:

dc=aither,dc=os
├── ou=users
│   ├── uid=admin
│   ├── uid=demo
│   └── uid=athena (aitherUserType=agent)
├── ou=groups
│   ├── cn=administrators
│   ├── cn=security-team (distribution list)
│   └── cn=platform-agents
├── ou=roles
│   ├── cn=owner
│   ├── cn=admin
│   └── cn=starter
├── ou=tenants
│   ├── o=platform
│   │   ├── ou=users
│   │   └── ou=groups
│   └── o=public
│       ├── ou=users
│       └── ou=groups
├── ou=services
│   ├── cn=Genesis (orchestrator)
│   ├── cn=MicroScheduler (LLM scheduling)
│   ├── cn=Strata (storageRole=virtual_storage_system)
│   └── cn=smtp (provider=resend, relay=smtp.resend.com)
└── ou=devices
    ├── cn=gpu-node-1 (meshId=..., gpuClass=A100)
    └── cn=edge-sensor-3 (deviceType=iot)

Six object classes cover everything:

Object ClassWhat It RepresentsKey Attributes
aitherUserHuman users AND agent identitiesuid, roles, groups, userType, a2a metadata
aitherGroupRBAC groups and email distribution listsmembers, roles, groupType
aitherRolePermission bundlespermissions (resource:action:scope), inherits
aitherTenantIsolated tenant contextsplanTier, quotas, slug
aitherServiceMicroservice registrationsservice metadata, cert bindings
aitherDeviceMesh nodes, GPUs, IoTdeviceType, meshId, gpuClass

The DN (Distinguished Name) for every object follows LDAP conventions. A user admin in the platform tenant has DN uid=admin,ou=users,dc=aither,dc=os. A user bob in customer tenant acme has DN uid=bob,ou=users,o=acme,ou=tenants,dc=aither,dc=os. Standard LDAP base/onelevel/subtree scoping just works.

SQLite WAL: The Unfashionable Choice That Works

The directory store is a single SQLite database in WAL (Write-Ahead Logging) mode. Two tables:

CREATE TABLE entries (
    dn TEXT PRIMARY KEY,
    object_classes TEXT NOT NULL,  -- JSON array
    attributes TEXT NOT NULL,      -- JSON dict {key: [values]}
    created_at TEXT,
    updated_at TEXT
);

CREATE TABLE changelog (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    dn TEXT NOT NULL,
    operation TEXT NOT NULL,       -- add/update/delete
    actor TEXT,
    timestamp TEXT,
    changes TEXT                   -- JSON diff
);

Why SQLite and not the PostgreSQL we already have for RBAC?

Boot order. PostgreSQL is an external dependency that might not be running when infrastructure services start. SQLite is in-process, zero-configuration, and available at import time. The directory needs to be the first thing alive.

Concurrency model. WAL mode gives us concurrent reads with a single writer — which is exactly our access pattern. Hundreds of RBAC checks per second reading, with occasional writes when agents register or users update their profiles.

Portability. A single .db file that can be backed up, replicated, or inspected with standard SQLite tools. No connection pooling, no TCP overhead, no auth configuration.

The changelog table is the other half of the story. Every write — add, update, delete — gets a changelog entry with the actor, timestamp, and JSON diff. This changelog feeds directly into Strata (our virtual storage and telemetry system) for audit trails and training data harvesting.

Schema Conversion: The _enum_val Trap

Converting between Python dataclasses and directory entries sounds trivial. It almost is — until you hit Python's enum edge cases.

Our UserType is defined as:

class UserType(str, Enum):
    HUMAN = "human"
    AGENT = "agent"
    SERVICE = "service"

The (str, Enum) base makes isinstance(user_type, str) return True. So when our conversion code did:

attrs["aitherUserType"] = [user_type if isinstance(user_type, str) else str(user_type)]

It passed through UserType.HUMAN as the string "UserType.HUMAN" instead of "human". The isinstance check was true (it's a str subclass), but the string representation includes the class name.

The fix was a dedicated helper:

def _enum_val(v) -> str:
    return v.value if hasattr(v, "value") else str(v)

A linter helpfully removed this "unused" function. Tests caught it. Twice. The lesson: if you have a helper that exists to work around a language quirk, give it a docstring that explains the quirk, or your tooling will delete it.

The LDAP Protocol Server: 300 Lines of BER

LDAP is a binary protocol. Messages are encoded in ASN.1 BER (Basic Encoding Rules), a TLV (Tag-Length-Value) format from the 1980s that predates HTTP, JSON, and most people reading this.

We implemented a pure Python BER encoder/decoder and a subset of LDAPv3 (RFC 4511) in about 300 lines:

The server is started as an asyncio TCP server and handles three operations:

BIND — Simple authentication. Anonymous binds are read-only. Named binds look up the user in the directory, verify the password hash with bcrypt, and grant authenticated access. No SASL, no Kerberos — simple auth only for now.

SEARCH — The real workhorse. Takes a base DN, scope (base/onelevel/subtree), and filter (equality, presence, substring). Translates directly to the directory store's search methods. Returns standard LDAP SearchResultEntry messages.

UNBIND — Close the connection.

Write operations (ADD, MODIFY, DELETE) return UNWILLING_TO_PERFORM. The directory is authoritative through its REST API; LDAP is a read-only query interface. This is intentional — if Grafana, Portainer, or an external monitoring tool wants to query the service registry, they can point an LDAP client at the directory and get results. But they can't modify the tree.

Why not use ldaptor or python-ldap? Dependency weight. ldaptor pulls in Twisted. python-ldap requires OpenLDAP C libraries. Our entire LDAP server is a single Python file with zero external dependencies. It handles the 95% case (directory browsing, service discovery, user lookups) and explicitly rejects the edge cases (referrals, extended operations, paged results) that would triple the code size.

Three Sync Engines

The directory doesn't generate data — it aggregates it from authoritative sources.

Agent Sync

Reads from four sources:

  • Agent identity files — 16 YAML files defining name, role, capabilities, and tool profiles
  • A2A protocol cards — Agent skills, endpoints, and authentication metadata
  • Agent email addresses — Per-agent email addresses for inter-agent communication
  • Dynamic registrations — Agents that register at runtime

All four sources merge into unified user entries with an agent type flag. The A2A card data, tool profiles, and email addresses go into metadata as a JSON blob. The identity files remain the authoritative source — the directory is a runtime cache that enables cross-service queries.

Service Sync

Reads the central service registry (the single source of truth for all 97 services) and creates service entries. Each entry captures the service's metadata, dependencies, features, and compound/absorption relationships.

Certificate bindings from the certificate service get synced as attribute overlays (CA ID, serial, expiration, fingerprint). SMTP config from the mail service gets its own entry. Email distribution lists become group entries with a distribution list type.

And Strata — our virtual storage and telemetry lake — gets properly classified with its storage role, type, and capabilities (event ingestion, session logging, audit trail, training data export, metrics aggregation).

Tenant Sync

Creates entries for the two default tenants (platform and public) and any registered customer tenants. Each tenant gets sub-OUs (ou=users, ou=groups) so tenant-scoped objects live in their own subtree. This means you can search a tenant's user subtree to find only that tenant's users — standard LDAP scoping.

The RBAC Integration: Four-Tier Backend Chain

The most delicate part was integrating with the existing RBAC system. AitherOS already had a three-tier storage chain:

PostgreSQL → SQLite → JSON seed files

The directory becomes tier zero:

Directory → PostgreSQL → SQLite → JSON seed files

During RBAC data loading, we now try the directory backend first. If the directory has users, groups, and roles, we use it. If it's empty (first boot), we fall through to the next tier and then seed the directory from whatever backend had the data.

A critical guard ensures the directory backend is skipped during testing, so tests that create isolated databases don't accidentally find the production directory and load real data from it.

External LDAP Client: Enterprise Sync

For enterprise deployments where an existing Active Directory or OpenLDAP server is the authoritative user store, we have an outbound sync client. Configuration covers the LDAP server address, bind credentials, user and group search bases, search filters, and group-to-role mappings (e.g., "Domain Admins" maps to "owner", "IT-Security" maps to "security").

The external LDAP client uses the ldap3 library (optional dependency) to periodically sync users and groups from the external directory. It also supports pass-through credential verification — when a user authenticates, AitherIdentity can verify their credentials against the external LDAP server directly.

Attribute mapping is configurable. AD uses sAMAccountName where we use uid. AD uses memberOf where we use aitherGroups. The client handles the translation.

The Service

AitherDirectory runs as a FastAPI service in the infrastructure layer. It's one of the first services to boot, right alongside logging, secrets, and telemetry.

25+ REST endpoints cover:

  • Entry CRUD/entry (GET/PUT/DELETE), /entries/search (POST)
  • Typed lookups/users, /users/{uid}, /agents, /groups/{cn}, /roles/{cn}, /tenants/{id}, /services, /services/{cn}, /devices
  • Sync operations/sync/agents, /sync/services, /sync/tenants, /sync/certificates, /sync/smtp, /sync/all
  • Diagnostics/changelog, /stats, /tree

At startup, the service runs an initial sync: tenants first (so the OU structure exists), then agents, then services. The LDAP server starts alongside the REST API. And every changelog entry gets forwarded to Strata for the audit trail and training data pipeline.

88 Tests and What They Caught

The test suite covers 16 test classes across every layer:

  • Directory store — CRUD, upsert, subtree search, filter matching, changelog recording, DIT skeleton creation, case-insensitive attribute handling
  • Schema conversion — User/Group/Role/Tenant/Service/Device bidirectional conversion, the enum value trap
  • RBAC backend — RBAC interface compliance, load/save/delete operations
  • Agent sync — YAML parsing, multi-source merge, A2A metadata
  • Service sync — Registry ingest, certificate binding overlay, SMTP config, distribution lists, Strata storage classification
  • Tenant sync — Default tenants, sub-OU creation, custom tenants
  • LDAP BER — Integer/string/boolean/sequence encoding, short/medium/long length encoding, sequence decoding
  • LDAP Session — Anonymous bind, admin bind (bcrypt), invalid bind rejection, search result formatting
  • LDAP Protocol — asyncio.Protocol data handling, UNBIND processing
  • External LDAP client — Config parsing, defaults, AD user entry building
  • HTTP client — Client creation, availability check behavior
  • RBAC Integration — End-to-end: write user via directory backend, search via directory store, verify agent type filtering, full sync across all object types

The most valuable test? The RBAC integration test that caught a guard regression. Without it, a user persistence test would load users from the production directory database instead of the test's isolated database — a failure mode that only manifests in CI where the directory database has real data in it.

What Comes Next

The directory is read-only over LDAP today. Write support (LDAP ADD/MODIFY/DELETE with ACL enforcement) is on the roadmap for enterprise deployments where external tools need to manage users directly.

SAML and OIDC federation are planned as attribute overlays on the directory — aitherSAMLEntityId, aitherOIDCClientId — so that AitherIdentity can look up federation config from the directory instead of its own local state.

And the big one: cross-node directory replication. When AitherOS runs across multiple machines in a mesh, each node needs a local directory replica for sub-millisecond RBAC checks. CRDTs (Conflict-free Replicated Data Types) on the changelog are the leading approach — each node maintains a local SQLite replica, and changelog entries are exchanged and merged using vector clocks.

For now, one tree, one database, one source of truth. Every identity object in AitherOS — human, agent, service, tenant, device — lives in dc=aither,dc=os. If you can't find it in the tree, it doesn't exist.