Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

Live

Connecting to services…

•

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

engineeringarchitectureperformancemulti-tenant

Scaling Multi-Tenant Concurrency in a Single AI Platform: How AitherOS Handles It

April 25, 202612 min readAitherium

Scaling Multi-Tenant Concurrency in a Single AI Platform

When you run a multi-tenant AI platform — where dozens of tenants, hundreds of workspaces, and thousands of concurrent users share the same inference backends, context pipelines, and memory stores — every architectural decision either compounds into efficiency or compounds into chaos.

This post documents how AitherOS handles multi-tenant concurrency at every layer of the stack, from request scoping to token-level deduplication. These aren't theoretical patterns. They're in production code, handling real concurrent load across isolated tenants on shared infrastructure.

The Problem Space

A single AitherOS instance serves multiple tenants. Each tenant has multiple workspaces. Each workspace has multiple users. Each user has multiple concurrent sessions. Every request needs:

Isolation — Tenant A must never see Tenant B's data, context, or conversation history
Efficiency — When users in the same workspace reference the same codebase, we shouldn't embed it twice or tokenize the same system prompt on every request
Fairness — One tenant's burst of 100 requests shouldn't starve another tenant's single request
Correctness — Concurrent writes to shared state (memory graphs, context caches, session directories) must not corrupt data

These requirements are in tension. Isolation wants everything separate. Efficiency wants everything shared. Here's how we resolve it.

Layer 1: Five-Level Request Scoping

Every request in AitherOS carries a RequestScope — a hierarchical identity that flows through the entire call chain via Python's contextvars:

Platform → Tenant → Workspace → Project → User/Session

This isn't just metadata. It's the access control boundary. The RequestScope is set at the API gateway (Genesis) and propagated automatically through async context. Every downstream service — context assembly, memory retrieval, LLM dispatch — reads scope from context, never from request parameters.

class RequestScope:
    tenant: TenantContext
    workspace_id: Optional[str]
    project_id: Optional[str]
    user_id: Optional[str]
    
    def cache_namespace(self, level: str) -> str:
        """Deterministic cache key at any granularity."""
        # "workspace" → "acme:ws-frontend"
        # "user"      → "acme:ws-frontend:user-42"
        # "tenant"    → "acme"

The cache_namespace() method is the foundation for every cache in the system. It produces deterministic, hierarchical keys that can be scoped at any level.

Why ContextVars, Not Thread-Locals?

Because we're async-first. Python's threading.local() doesn't propagate across await boundaries. contextvars.ContextVar does. When a request handler awaits a database call, the context survives. When we spawn concurrent tasks for parallel context assembly, each task inherits the correct scope.

Layer 2: Scope-Aware Context Cache (L1 — In-Memory)

AitherOS assembles context through a 10-stage pipeline (ContextPipeline). The output is a set of scored, typed chunks — persona definitions, conversation history, RAG results, memory graph entries, file contents. These chunks live in an ActiveContextCache with surgical eviction.

The problem: in a multi-tenant system, whose chunks are whose?

Our solution separates chunks into shared and private scopes:

Shared scope (workspace level): RAG results, codebase knowledge, project documentation. If User A and User B are in the same workspace, they share these chunks.
Private scope (user level): Conversation history, affect state, personal memories. These never leak between users.

# Source classification — which chunks are private?
_PRIVATE_SOURCES = {"spirit", "affect", "conversation", "conversation_recall"}

def _is_private_chunk_source(source: str) -> bool:
    return source in _PRIVATE_SOURCES

When a chunk is injected into the cache, it's automatically classified and attached to the appropriate scope via ContextVars:

_CURRENT_SHARED_CACHE_SCOPE = ContextVar('shared_cache_scope')
_CURRENT_PRIVATE_CACHE_SCOPE = ContextVar('private_cache_scope')

Lookup order: private scope → shared scope → session fallback → global cache. A user always sees their own conversation history overlaid on workspace-shared context.

Layer 3: Redis-Backed L2/L3 Cache

The L1 cache is per-process. When a pod restarts or a request lands on a different worker, the cache is cold. For a multi-tenant system with dozens of concurrent users, this means re-assembling identical context constantly.

Enter RedisContextCache — a three-tier Redis-backed cache with graceful degradation:

Tier	Scope	Key Pattern	TTL	Purpose
L2	Workspace	`ctx:{tenant}:{workspace}:{hash}`	1 hour	Cross-session chunk sharing
L3	Tenant	`emb:{tenant}:{hash}`	7 days	Cross-workspace embedding reuse
Global	Platform	`tkn:{model}:{hash}`	7 days	Token count deduplication

Write-Through Architecture

When a chunk is injected into the L1 cache, it's simultaneously written to Redis L2 (fire-and-forget async). On L1 miss, we check Redis L2 and warm the L1 cache with the result:

# On inject: write-through to Redis
def inject(self, chunk, scope_id=None):
    # ... L1 insert ...
    self._write_through_redis_l2(scope_id, chunk)

# On read: check Redis on L1 miss
def _get_scoped_chunks(self, scope_id, **kw):
    chunks = self._cache.get_scope_chunks(scope_id)  # L1
    if not chunks:
        chunks = self._fetch_redis_l2_chunks(scope_id)  # L2
    return chunks

Graceful Degradation

Redis is never a hard dependency. The entire RedisContextCache falls back to an in-memory bounded LRU cache when Redis is unavailable. The fallback registers with our DegradationRegistry, so operators see it in the health dashboard but the system keeps running.

Layer 4: Token Count Caching

LLM token counting is surprisingly expensive at scale. Every context assembly calls count_tokens() multiple times — to measure chunks, to clamp context windows, to estimate costs. The same system prompt gets re-tokenized on every request.

We cache token counts using content-addressed SHA-256 keys with an 8K-entry bounded LRU. This single optimization eliminates ~60% of tokenizer calls in practice, because system prompts, persona blocks, and workspace knowledge are highly repetitive across requests.

Layer 5: Deterministic Prompt Prefix

LLM providers (OpenAI, Anthropic) cache prompt prefixes — if two requests share the same prefix, the provider can skip re-processing it. But this only works if the prefix is byte-identical across requests.

We enforce deterministic assembly of the system prompt by ordering sections from most-stable to least-stable:

system_sections = [
    effective_system,       # Base system prompt (stable)
    live_system_context,    # Workspace context (stable per workspace)
    gathered_system_context, # Gathered knowledge (changes rarely)
    all_context,            # Dynamic context (changes per request)
]

For users in the same workspace with the same persona, the first two sections are identical — that's typically 2-4K tokens of free caching from the provider.

Layer 6: Workspace-Safe Memory Isolation

The ComposableMemoryGraph is our unified memory system. Each memory entry belongs to a MemoryScope with a namespace that includes tenant_id:workspace_id:agent_id:project_id:user_id. This ensures retrieval is always scoped correctly.

Embedding Dedup Across Workspaces

When the same file content exists in multiple workspaces (common with shared codebases), we don't embed it twice. The embed stage checks three locations:

In-memory LRU (keyed by tenant_namespace:content_hash) — fastest
Redis L3 (keyed by emb:{tenant}:{content_hash}) — survives restarts
Embedding API — only on double miss

For tenants with multiple workspaces sharing a monorepo, this eliminates 70-90% of embedding API calls.

Layer 7: Cross-Tenant Content-Addressed Blob Dedup

The most aggressive dedup: when two different tenants reference identical content (common with popular open-source libraries or framework boilerplate), we store one copy:

blob:{sha256_of_content}  →  the content itself
blobref:{tenant}:{workspace}:{path}  →  sha256_hash

The blob layer is content-addressed and tenant-blind. The reference layer is tenant-aware and enforces access control. Redis's SET NX ensures atomic dedup — two concurrent writes of identical content result in one stored blob.

Layer 8: Concurrency Primitives

asyncio.Lock for async-only paths (context assembly, Redis operations)
threading.Lock for sync paths (token count cache, in-memory fallback)
ContextVar for request-scoped state (no locking needed — each request has its own copy)
Per-tenant rate limiting at the MicroScheduler layer
Circuit breakers on Redis, LLM backends, and embedding APIs with DegradationRegistry tracking

What This Means in Practice

For a typical multi-tenant deployment with 10 tenants, 50 workspaces, and 200 concurrent users:

Optimization	Savings
L2 workspace chunk cache	~40% reduction in context assembly time
L3 tenant embedding cache	~70-90% reduction in embedding API calls
Token count caching	~60% reduction in tokenizer invocations
Deterministic prompt prefix	~30% reduction in provider inference cost
Cross-tenant blob dedup	~20-50% reduction in Redis memory for shared codebases

These compound. A request that would take 3 seconds to assemble context on a cold cache takes 200ms on a warm one.

The Design Principle

Every layer follows the same pattern:

Content-address everything — SHA-256 hash of content is the universal key
Scope at the right level — Don't scope tighter than necessary (wastes cache), don't scope looser than necessary (leaks data)
Degrade gracefully — Every cache is optional. The system works without Redis, without provider caching, without embedding dedup. It just costs more.
Write-through, read-through — L1 is always the hot path. L2/L3 are populated on write and consulted on miss. No cache invalidation storms.

Multi-tenancy isn't a feature you bolt on. It's a property of how you address, scope, and share data at every layer. Get the scoping right, and efficiency follows naturally.

Enjoyed this post?

All posts Try AitherOS