AitherOS started as one person's AI operating system. One machine. One user. Full trust. Every agent had access to everything. Memory was global. The orchestrator answered to whoever asked.

That model doesn't survive contact with a second user.

The moment you let someone else touch the system — a demo visitor, a paying customer, a partner's agent — you need answers to questions that never existed before. Can User B read User A's memories? Can a free-tier visitor trigger the reasoning model? Can an agent spawned by a Discord bot execute shell commands? Can a tenant's conversation context leak into another tenant's system prompt?

We had to retrofit isolation across 200+ microservices without breaking the single-user experience that made the platform good in the first place. Here's how.

The Trust Boundary Problem

In the original architecture, trust was binary. You were either on the machine or you weren't. Every request came from localhost. Every agent had full capabilities. The system prompt included all memories, all context, all tools. This was fine because "the user" meant one person.

Multi-user breaks this in every direction simultaneously:

Compute isolation: User A's casual chat shouldn't wait behind User B's deep reasoning request
Memory isolation: User A's private conversations can't appear in User B's context window
Tool isolation: A free-tier demo visitor shouldn't be able to spawn agent swarms or execute shell commands
Cost isolation: You can't let everyone trigger the reasoning model when it costs real GPU time or cloud API credits
Context isolation: The system prompt assembled for User A must contain zero fragments from User B's sessions

Most multi-tenant systems solve this at the database layer — row-level security, schema-per-tenant, separate databases. That handles storage. But an AI operating system has a much wider attack surface: the LLM itself. An LLM doesn't respect database ACLs. If User B's teaching gets embedded into the context window, the model will use it regardless of who asked the question.

We needed isolation at every layer — not just storage, but memory retrieval, context assembly, tool access, LLM routing, and event propagation.

Layer 1: Caller Identity — Who's Asking?

Every request entering AitherOS gets tagged with a CallerContext before anything else happens. This isn't a header you can set — it's computed from the request's origin:

CallerType (5 levels):

PLATFORM — Local machine, internal services, operator. Full trust. Only granted to requests from RFC 1918 private IP ranges (127.x, 10.x, 192.168.x, 172.16-31.x).
TENANT — Authenticated external user with a plan. Permissions scale with tier.
DEMO — Authenticated but on the demo/playground. Can chat and use agents, can't mutate.
PUBLIC — Unauthenticated external visitor. Read-only chat, explorer quotas.
ANONYMOUS — No identity at all. Minimal access.

Each CallerType carries six permission flags:

Permission	PLATFORM	TENANT	DEMO	PUBLIC	ANONYMOUS
can_agentic	yes	yes	yes	yes	no
can_forge	yes	by tier	yes	no	no
can_mutate	yes	by tier	no	no	no
can_execute	yes	by tier	no	no	no
can_generate	yes	by tier	no	no	no
can_multi_agent	yes	by tier	no	no	no

The critical design decision: fail-closed. If a request arrives from a non-local IP without a valid tenant header, it becomes PUBLIC — never PLATFORM. There is no code path where an external request gets full trust, regardless of what headers it sends. The Veil proxy sets X-Caller-Type, but the backend validates it against the source IP. Spoofing the header from an external IP gets you PUBLIC, not PLATFORM.

This CallerContext propagates through the entire async request chain via Python's ContextVar. Every function in the call stack — ChatEngine, AgentForge, ActionExecutor, LLMGateway — can read the caller identity without it being passed as a parameter. It's ambient, unforgeable (within the process), and automatic.

Layer 2: Tenant Context — Scoped Everything

Once we know who is asking, we need to know which tenant space they belong to. The TenantContext carries:

tenant_id: Format tnt_{8-hex-chars} — validated to prevent injection
plan_tier: EXPLORER through ENTERPRISE (7 tiers)
effort_cap: Maximum LLM effort level this tenant can trigger
Isolation prefixes: Computed properties that scope every subsystem

The prefixes are where isolation becomes structural:

tenant.strata_prefix  → "tenants/growth-co/"      # Storage paths
tenant.flux_prefix    → "growth-co:"               # Event bus channels  
tenant.secrets_namespace → "tenants/growth-co"     # Secret vault scope
tenant.db_name        → "aither_tnt_a1b2c3d4"     # PostgreSQL database

Every subsystem that stores or retrieves data uses these prefixes. Strata stores chunks under tenants/{slug}/chunks/. FluxBus publishes events to {slug}:sessions:{id}. AitherSecrets scopes vault access to tenants/{slug}. There's no global namespace where tenant data mingles — it's prefix-isolated at the storage layer.

The public tenant is a well-known singleton: tenant_id="public", plan_tier=EXPLORER, with the tightest quotas. Every unauthenticated visitor, every demo user, every Discord bot without explicit tenant credentials lands here automatically. Their data goes into tenants/public/ and stays there.

Layer 3: Memory Isolation — Your Agent Doesn't Remember Their Conversations

This is the layer that matters most for an AI system, and the one most multi-tenant platforms get wrong.

The memory stack in AitherOS has four tiers:

Working Memory — Per-session short-term context (what happened in this conversation)
Spirit Memory — Long-term teachings, insights, identity (what the agent has learned)
Knowledge Graph — Structured knowledge nodes with embeddings (what the agent knows)
Conversation Store — Historical chat sessions (what was said before)

Every node, entry, and record in all four tiers carries a tenant_id field. Every query filters by it.

When the Knowledge Graph runs hybrid_query() — combining keyword search with vector similarity — it filters results by the caller's tenant before ranking. A brilliantly relevant memory from Tenant A will never surface in Tenant B's results, even if the embeddings are nearly identical.

The Conversation Store uses tenant-scoped cache keys: {tenant_id}:{conversation_id}. Two tenants can have conversations with the same ID without collision or leakage.

Spirit Memory (long-term teachings) is the trickiest. When the platform teaches the agent something — "always use TypeScript for frontend code" — that's a platform-level teaching visible to all. When a tenant teaches the agent something — "our API uses camelCase, not snake_case" — that's tenant-scoped. The teaching endpoints tag everything with the caller's tenant, and retrieval filters accordingly.

Why this matters: Without memory isolation, Tenant B's preferences bleed into Tenant A's experience. The agent starts suggesting camelCase to a Python shop because another tenant's teaching is in the embedding space. In a traditional app, this is a data leak. In an AI system, it's a behavioral leak — harder to detect, harder to debug, and potentially more damaging.

Layer 4: Context Pipeline Isolation — What Goes Into the System Prompt

The context pipeline assembles the system prompt through 12 stages:

[AXIOMS] → [IDENTITY] → [SOUL] → [RULES] → [CAPABILITIES] → 
[CONTEXT] → [MEMORIES] → [AFFECT] → [TOOLS] → [HISTORY] → 
[REASONING] → [RESPONSE FORMAT]

Tenant isolation touches stages 5 through 10:

[CONTEXT]: Fetches relevant context from Strata — scoped by tenant.strata_prefix
[MEMORIES]: Retrieves Spirit teachings and insights — filtered by tenant_id
[AFFECT]: Emotional state and personality — tenant can have custom soul overlays
[TOOLS]: Tool availability — filtered by caller permissions and tier
[HISTORY]: Conversation history — tenant-scoped cache keys in ConversationStore

The first four stages ([AXIOMS] through [CAPABILITIES]) are platform-level — they define what AitherOS is. These are the same for every tenant. The remaining stages are tenant-scoped — they define what the agent knows and can do for this specific user.

The result: two users chatting simultaneously get system prompts that share the same identity and rules but contain completely different memories, context, tool sets, and conversation history.

Layer 5: Compute Isolation — Not Everyone Gets the Good GPU

This is where multi-tenancy gets expensive. LLM inference costs real money — either GPU time or cloud API credits. You can't give everyone unlimited access to the reasoning model.

The effort system maps user intent to compute cost:

Effort Level	Model Tier	Example
1-2	Fast (orchestrator)	"Hi", "What time is it?"
3-6	Balanced (orchestrator)	Code questions, explanations, analysis
7-8	Deep (reasoning tool)	Architecture review, complex debugging
9-10	Ultra (reasoning + multi-agent)	Full codebase audit, swarm coding

Each plan tier has an effort cap — the maximum effort level the system will execute:

Plan Tier	Effort Cap	What It Means
EXPLORER (free)	6	Chat with the orchestrator only. No reasoning model.
BUILDER	6	Same. Good for casual use.
GROWTH	8	Unlocks deep reasoning (effort 7-8)
PROFESSIONAL	10	Full reasoning + multi-agent swarms
ENTERPRISE	uncapped	Everything, no limits
PLATFORM (operator)	uncapped	Everything, no limits

When the IntentEngine classifies a message at effort 8 but the tenant's cap is 6, the EffortScaler clamps it down. The user gets a good answer from the orchestrator instead of a great answer from the reasoning model. They're not told "upgrade to unlock this" — the system just does its best within the budget. The experience degrades gracefully, not with a paywall error.

This is also what makes the reasoning-as-a-tool architecture possible at the business level. Because reasoning is a network call to a remote GPU or cloud API, the cost is per-invocation, not per-seat. Capping effort at 6 for free-tier users means they never trigger a billable reasoning call. The orchestrator handles their requests locally at near-zero marginal cost.

Layer 6: Tool Access — What Can You Actually Do?

The MCP Gateway (port 8180) serves external tools access and enforces tier-based tool scoping. Three tool tiers:

EXPLORER (free): Code search, memory queries, read-only git operations, stateless analysis. You can look at things. You can't change things.

PRO: Agent delegation, swarm coding, web search, advanced code graph tools. You can make agents do things.

ENTERPRISE: Full system access minus internal-only modules (boot orchestrator, service manager, dark factory). You can make agents do almost anything.

Rate limiting is per-tenant via token buckets. Each tier has different request/minute caps and burst allowances. An Explorer tenant burning through their rate limit gets 429s, not degraded service for everyone else.

Internally, 30+ modules are categorized as INTERNAL_MODULES and are never exposed to any SaaS tier. These are the operational guts — service management, boot orchestration, infrastructure tools. Even Enterprise tenants don't see them. The operator console (PLATFORM caller from localhost) is the only path.

Layer 7: Pipeline Gates — Hard Stops in the Processing Chain

CallerContext checks happen at three critical chokepoints:

ChatEngine.process() — First gate. Reads the caller identity, determines whether to allow agentic processing, image generation, or tool usage. A PUBLIC caller gets basic chat. A TENANT caller gets features matching their tier.
AgentForge._dispatch() — Second gate. Checks can_forge before spawning a sub-agent. Every ForgeSpec carries tenant_id — sub-agents inherit the parent's tenant scope automatically. An agent spawned by Tenant A operates in Tenant A's namespace, even if the agent persona is shared.
ActionExecutor.execute() — Third gate. Checks can_mutate and can_execute before any shell command, file write, or service call. This is the last line of defense. Even if an agent is convinced it should write a file (prompt injection, hallucination, whatever), the ActionExecutor checks the caller's permissions and blocks it.

These gates are in code, not in prompts. The LLM doesn't know they exist. It can decide to execute a shell command; the execution layer will silently refuse if the caller lacks can_execute. No error message back to the model — just a clean denial that the model interprets as "the tool returned no result."

Layer 8: Event & Session Isolation

AitherOS is event-driven. FluxBus (Redis-backed pub/sub) carries real-time events — agent status updates, memory changes, context invalidations, health signals. In a multi-tenant system, events must be scoped.

Every FluxBus channel uses the tenant's flux prefix: {slug}:sessions:{session_id}, {slug}:agents:{agent_id}, {slug}:events:{type}. Tenant A subscribing to their event stream sees only their events. There's no broadcast channel where tenant data mingles (except platform-level operational events like health checks, which contain no tenant data).

Session isolation follows the same pattern. The ConversationStore keys sessions as {tenant_id}:{conversation_id}. The ContextPipeline reads session history through tenant-scoped queries. Two users having simultaneous conversations share the same orchestrator model, but their sessions are completely separate — different context windows, different memory retrievals, different conversation histories.

ContextVar propagation is the mechanism that makes this work without passing tenant_id through every function signature. When a request enters through FastAPI, the middleware sets _current_tenant on the ContextVar. Every async function in the call chain — through ChatEngine, through ContextPipeline, through MemoryGraph queries, through FluxBus publications — reads from the same ContextVar. It's automatic, it's unforgeable (within the process), and it survives across await boundaries.

Layer 9: RBAC — The User-Level Layer

Above tenant isolation sits role-based access control. Users within a tenant can have different roles:

admin — Full access within the tenant's scope
developer — Can use agents, tools, deploy
viewer — Read-only access to dashboards and logs
agent — AI agent identity (yes, agents are RBAC users too)

Each role carries granular permissions (CREATE, READ, UPDATE, DELETE, EXECUTE, ADMIN) on specific resources. Groups aggregate roles. Users can belong to multiple groups.

The RBAC layer sits above CallerContext — it's checked first. If RBAC denies access, the request never reaches the pipeline gates. If RBAC allows it, CallerContext permissions still apply. It's two-layer security: RBAC controls what you can access, CallerContext controls what the system will do.

Agent identities in RBAC are important for auditability. When Demiurge (the code agent) modifies a file on behalf of Tenant A, the audit log shows: who (demiurge), what (file write), where (tenant A's workspace), when (timestamp), and why (task ID). The agent's RBAC user carries the tenant scope, so its actions are always attributed correctly.

What We Didn't Have to Rewrite

The remarkable thing about this architecture is what didn't change. The core processing pipeline — ChatEngine → ContextPipeline → UnifiedChatBackend → LLMGateway → LLMQueue → vLLM — is identical for single-user and multi-tenant. The orchestrator model doesn't know about tenants. The LLM gateway doesn't filter by tenant. The queue processor doesn't check permissions.

Isolation happens around the LLM, not inside it. By the time a request reaches the model, it's already been:

Authenticated and tagged with CallerContext
Scoped to a tenant namespace
Had its context assembled from tenant-isolated memory
Had its tools filtered by tier
Had its effort clamped by plan

The model sees a clean request with the right context and the right tools. It doesn't need to know it's serving multiple tenants. It just answers the question with the information it was given.

This is the same pattern as the reasoning-as-a-tool architecture: keep the core simple, enforce complexity at the boundaries. The orchestrator is a stateless router. All state — identity, permissions, memory, context — lives in the layers around it.

The Numbers

5 caller types with 6 permission flags each
7 plan tiers with effort caps and tool scoping
4 memory tiers all tenant-isolated (working, spirit, knowledge graph, conversations)
12-stage context pipeline with tenant filtering on 6 stages
3 pipeline gates (ChatEngine, AgentForge, ActionExecutor)
Prefix-based isolation across Strata, FluxBus, Secrets, and PostgreSQL
ContextVar propagation — zero-parameter tenant flow through async chains
Fail-closed everywhere — invalid tenant → PUBLIC, never PLATFORM
500+ tests covering tenant isolation, caller context, public tenant, graph isolation, and RBAC

The Tradeoff

There's a cost to all this isolation: complexity. Every new feature needs to ask "does this respect tenant scope?" Every memory query needs a tenant filter. Every event publication needs a prefix. Every tool registration needs a tier check.

But the alternative — bolting multi-tenancy onto a system that assumes single-user — is worse. We've seen what happens when AI platforms treat isolation as an afterthought: data leaks that manifest as behavioral changes, context contamination that's impossible to reproduce, cost overruns from unmetered compute access.

The single-user experience is preserved. If you run AitherOS on your own machine, you're PLATFORM caller type with uncapped effort and full tool access. The isolation layers are there but transparent — every check passes, every filter matches, every prefix resolves to the same global namespace. You don't pay for multi-tenancy you're not using.

But when the second user arrives — whether it's a demo visitor, a paying customer, or a partner's agent fleet — every wall is already standing. Their data goes into their namespace. Their compute stays within their budget. Their agent can't read your memories, execute your commands, or trigger your reasoning model.

The system was always multi-tenant. It just didn't know it yet.

Enjoyed this post?

All posts Try AitherOS

Back to blog

architecturemulti-tenantsecuritysaasisolationdeep-dive

From Solo Machine to SaaS: How We Made an AI Operating System Multi-Tenant Without Rewriting It

March 18, 202613 min readAitherium