Scaling Multi-Tenant Concurrency in a Single AI Platform: How AitherOS Handles It
Scaling Multi-Tenant Concurrency in a Single AI Platform
When you run a multi-tenant AI platform — where dozens of tenants, hundreds of workspaces, and thousands of concurrent users share the same inference backends, context pipelines, and memory stores — every architectural decision either compounds into efficiency or compounds into chaos.
This post documents how AitherOS handles multi-tenant concurrency at every layer of the stack, from request scoping to token-level deduplication. These aren't theoretical patterns. They're in production code, handling real concurrent load across isolated tenants on shared infrastructure.
The Problem Space
A single AitherOS instance serves multiple tenants. Each tenant has multiple workspaces. Each workspace has multiple users. Each user has multiple concurrent sessions. Every request needs:
- Isolation — Tenant A must never see Tenant B's data, context, or conversation history
- Efficiency — When users in the same workspace reference the same codebase, we shouldn't embed it twice or tokenize the same system prompt on every request
- Fairness — One tenant's burst of 100 requests shouldn't starve another tenant's single request
- Correctness — Concurrent writes to shared state (memory graphs, context caches, session directories) must not corrupt data
These requirements are in tension. Isolation wants everything separate. Efficiency wants everything shared. Here's how we resolve it.
Layer 1: Five-Level Request Scoping
Every request in AitherOS carries a RequestScope — a hierarchical identity that flows through the entire call chain via Python's contextvars:
Platform → Tenant → Workspace → Project → User/Session
This isn't just metadata. It's the access control boundary. The RequestScope is set at the API gateway (Genesis) and propagated automatically through async context. Every downstream service — context assembly, memory retrieval, LLM dispatch — reads scope from context, never from request parameters.
class RequestScope:
tenant: TenantContext
workspace_id: Optional[str]
project_id: Optional[str]
user_id: Optional[str]
def cache_namespace(self, level: str) -> str:
"""Deterministic cache key at any granularity."""
# "workspace" → "acme:ws-frontend"
# "user" → "acme:ws-frontend:user-42"
# "tenant" → "acme"
The cache_namespace() method is the foundation for every cache in the system. It produces deterministic, hierarchical keys that can be scoped at any level.
Why ContextVars, Not Thread-Locals?
Because we're async-first. Python's threading.local() doesn't propagate across await boundaries. contextvars.ContextVar does. When a request handler awaits a database call, the context survives. When we spawn concurrent tasks for parallel context assembly, each task inherits the correct scope.
Layer 2: Scope-Aware Context Cache (L1 — In-Memory)
AitherOS assembles context through a 10-stage pipeline (ContextPipeline). The output is a set of scored, typed chunks — persona definitions, conversation history, RAG results, memory graph entries, file contents. These chunks live in an ActiveContextCache with surgical eviction.
The problem: in a multi-tenant system, whose chunks are whose?
Our solution separates chunks into shared and private scopes:
- Shared scope (workspace level): RAG results, codebase knowledge, project documentation. If User A and User B are in the same workspace, they share these chunks.
- Private scope (user level): Conversation history, affect state, personal memories. These never leak between users.
# Source classification — which chunks are private?
_PRIVATE_SOURCES = {"spirit", "affect", "conversation", "conversation_recall"}
def _is_private_chunk_source(source: str) -> bool:
return source in _PRIVATE_SOURCES
When a chunk is injected into the cache, it's automatically classified and attached to the appropriate scope via ContextVars:
_CURRENT_SHARED_CACHE_SCOPE = ContextVar('shared_cache_scope')
_CURRENT_PRIVATE_CACHE_SCOPE = ContextVar('private_cache_scope')
Lookup order: private scope → shared scope → session fallback → global cache. A user always sees their own conversation history overlaid on workspace-shared context.
Layer 3: Redis-Backed L2/L3 Cache
The L1 cache is per-process. When a pod restarts or a request lands on a different worker, the cache is cold. For a multi-tenant system with dozens of concurrent users, this means re-assembling identical context constantly.
Enter RedisContextCache — a three-tier Redis-backed cache with graceful degradation:
| Tier | Scope | Key Pattern | TTL | Purpose |
|---|---|---|---|---|
| L2 | Workspace | ctx:{tenant}:{workspace}:{hash} | 1 hour | Cross-session chunk sharing |
| L3 | Tenant | emb:{tenant}:{hash} | 7 days | Cross-workspace embedding reuse |
| Global | Platform | tkn:{model}:{hash} | 7 days | Token count deduplication |
Write-Through Architecture
When a chunk is injected into the L1 cache, it's simultaneously written to Redis L2 (fire-and-forget async). On L1 miss, we check Redis L2 and warm the L1 cache with the result:
# On inject: write-through to Redis
def inject(self, chunk, scope_id=None):
# ... L1 insert ...
self._write_through_redis_l2(scope_id, chunk)
# On read: check Redis on L1 miss
def _get_scoped_chunks(self, scope_id, **kw):
chunks = self._cache.get_scope_chunks(scope_id) # L1
if not chunks:
chunks = self._fetch_redis_l2_chunks(scope_id) # L2
return chunks
Graceful Degradation
Redis is never a hard dependency. The entire RedisContextCache falls back to an in-memory bounded LRU cache when Redis is unavailable. The fallback registers with our DegradationRegistry, so operators see it in the health dashboard but the system keeps running.
Layer 4: Token Count Caching
LLM token counting is surprisingly expensive at scale. Every context assembly calls count_tokens() multiple times — to measure chunks, to clamp context windows, to estimate costs. The same system prompt gets re-tokenized on every request.
We cache token counts using content-addressed SHA-256 keys with an 8K-entry bounded LRU. This single optimization eliminates ~60% of tokenizer calls in practice, because system prompts, persona blocks, and workspace knowledge are highly repetitive across requests.
Layer 5: Deterministic Prompt Prefix
LLM providers (OpenAI, Anthropic) cache prompt prefixes — if two requests share the same prefix, the provider can skip re-processing it. But this only works if the prefix is byte-identical across requests.
We enforce deterministic assembly of the system prompt by ordering sections from most-stable to least-stable:
system_sections = [
effective_system, # Base system prompt (stable)
live_system_context, # Workspace context (stable per workspace)
gathered_system_context, # Gathered knowledge (changes rarely)
all_context, # Dynamic context (changes per request)
]
For users in the same workspace with the same persona, the first two sections are identical — that's typically 2-4K tokens of free caching from the provider.
Layer 6: Workspace-Safe Memory Isolation
The ComposableMemoryGraph is our unified memory system. Each memory entry belongs to a MemoryScope with a namespace that includes tenant_id:workspace_id:agent_id:project_id:user_id. This ensures retrieval is always scoped correctly.
Embedding Dedup Across Workspaces
When the same file content exists in multiple workspaces (common with shared codebases), we don't embed it twice. The embed stage checks three locations:
- In-memory LRU (keyed by
tenant_namespace:content_hash) — fastest - Redis L3 (keyed by
emb:{tenant}:{content_hash}) — survives restarts - Embedding API — only on double miss
For tenants with multiple workspaces sharing a monorepo, this eliminates 70-90% of embedding API calls.
Layer 7: Cross-Tenant Content-Addressed Blob Dedup
The most aggressive dedup: when two different tenants reference identical content (common with popular open-source libraries or framework boilerplate), we store one copy:
blob:{sha256_of_content} → the content itself
blobref:{tenant}:{workspace}:{path} → sha256_hash
The blob layer is content-addressed and tenant-blind. The reference layer is tenant-aware and enforces access control. Redis's SET NX ensures atomic dedup — two concurrent writes of identical content result in one stored blob.
Layer 8: Concurrency Primitives
asyncio.Lockfor async-only paths (context assembly, Redis operations)threading.Lockfor sync paths (token count cache, in-memory fallback)ContextVarfor request-scoped state (no locking needed — each request has its own copy)- Per-tenant rate limiting at the MicroScheduler layer
- Circuit breakers on Redis, LLM backends, and embedding APIs with
DegradationRegistrytracking
What This Means in Practice
For a typical multi-tenant deployment with 10 tenants, 50 workspaces, and 200 concurrent users:
| Optimization | Savings |
|---|---|
| L2 workspace chunk cache | ~40% reduction in context assembly time |
| L3 tenant embedding cache | ~70-90% reduction in embedding API calls |
| Token count caching | ~60% reduction in tokenizer invocations |
| Deterministic prompt prefix | ~30% reduction in provider inference cost |
| Cross-tenant blob dedup | ~20-50% reduction in Redis memory for shared codebases |
These compound. A request that would take 3 seconds to assemble context on a cold cache takes 200ms on a warm one.
The Design Principle
Every layer follows the same pattern:
- Content-address everything — SHA-256 hash of content is the universal key
- Scope at the right level — Don't scope tighter than necessary (wastes cache), don't scope looser than necessary (leaks data)
- Degrade gracefully — Every cache is optional. The system works without Redis, without provider caching, without embedding dedup. It just costs more.
- Write-through, read-through — L1 is always the hot path. L2/L3 are populated on write and consulted on miss. No cache invalidation storms.
Multi-tenancy isn't a feature you bolt on. It's a property of how you address, scope, and share data at every layer. Get the scoping right, and efficiency follows naturally.