Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

architecturekv-cachecontext-pipelinememorygpuinferencedeep-divegraph

Unified Graph, Real-Time Context, and the KV Cache Bridge That Connects Them

Name: AitherOS
Author: Aitherium

May 25, 202628 min readAitherium

The problem nobody talks about

Every LLM inference system operates in two worlds simultaneously. Up top, you have the application layer: context pipelines, tool results, memory retrieval, system prompts assembled from 12+ sources. Down below, you have the physical layer: KV cache blocks, VRAM pages, prefix cache hash tables, eviction queues.

These two worlds don't talk to each other.

Your context pipeline spends 200ms carefully scoring and ranking context chunks. It decides that this code snippet from CodeGraph is more important than that web search result. It assigns priorities. It evicts the lowest-scored chunks to stay under budget.

Then vLLM tokenizes the assembled prompt, allocates KV cache blocks, and proceeds to evict them by LRU when memory pressure hits. It has no idea that block 47 contains your system axioms (priority 5, never evict) and block 193 contains a stale web search result (priority 1, evict first). To vLLM, they're both just 16-token blocks in a queue.

We had this problem at scale in AitherOS. 34 graph systems. 15 entry points. 208 microservices. And a growing frustration that our carefully curated context was being treated as opaque bytes the moment it crossed the tokenizer boundary.

This post describes how we fixed it.

The three layers of the problem

Layer 1: Tool results were ephemeral

When our ParallelToolPrefireMiddleware fires web_search, codegraph_search, or reason tools before generation, the results get injected into state.context_chunks. The LLM sees them. It generates a response. The results are never stored.

Next turn, same user asks "what did you find about X earlier?" The system has no memory of the search. The results existed for exactly one generation cycle.

We traced this across three middleware sites:

SearchMiddleware — web search pre-fetch
ParallelToolPrefireMiddleware — pre-fired read-only tools
PlanExecutorMiddleware — plan DAG tool execution

None of them persisted results to any memory tier.

Layer 2: 34 graphs with no unified query interface

AitherOS has graphs for everything: CodeGraph, ServiceGraph, MemoryGraph, WikipediaGraph, ConfigGraph, DirectoryGraph, RAGAnythingGraph, MediaGraph, LogGraph, StrataGraph, InfraGraph, TypeGraph... and the ContextPipeline's Stage 5.5 only queried one of them (AitherKnowledgeGraphBridge).

If a user asked about code, the code graph had great results. If they asked about infrastructure, the infra graph knew. But the pipeline couldn't fan out to multiple graphs based on query intent. It was a serial, single-backend query.

Layer 3: No bridge between context chunks and KV cache blocks

This is the hard one. We had:

ActiveContextCache — chunk-addressed, content-hash dedup, surgical eviction by score
KVCacheGraph — relationship graph over physical KV cache blocks with 5 edge types

Both managing the same data. Neither aware of the other.

When ActiveContextCache evicts a chunk, the corresponding KV blocks continue occupying VRAM until vLLM's LRU eviction gets around to them. When KVCacheGraph tracks a prefix cache hit on certain blocks, the corresponding chunks in ActiveContextCache don't get priority boosts. When we snapshot high-hit blocks to .tqkv files for warm start, we have no idea which agent identity or system prompt they correspond to.

The solution: three new modules, seven integration points

ToolResultPersistence: Making tool results durable

Tool executes -> ToolResultPersistence.persist()
    |
    v
Durability routing:
    ephemeral (git_status)      -> WorkingMemory (15min TTL)
    session   (git_log)         -> WorkingMemory (1hr TTL)
    persistent (web_search)     -> Spirit + Mind (7d TTL)
    |
    v
Graph edge creation:
    (session) --EXECUTED--> (tool_call) --PRODUCED--> (result)
    (query) --TRIGGERED--> (tool_call)
    |
    v
Context cache injection:
    -> ActiveContextCache (for prompt caching stability)

The key design choice: every tool has a durability classification. web_search results are persistent (importance 0.85, goes to Spirit L4 + Mind L3 for vector search). git_status results are ephemeral (importance 0.3, WorkingMemory with 15-minute TTL). reason tool outputs are the most persistent (importance 0.9).

We reuse the existing KnowledgePersistence._persist_to_spirit() pattern. Fire-and-forget async with a 2-second timeout before letting pending tasks finish in background. Content-hash dedup with a 1-hour window prevents re-persisting the same search result.

The critical integration: we inject results into ActiveContextCache with proper source mappings. This means on the next turn within the same session, the identical content block is already cached with a stable content_hash. The system prompt prefix doesn't change between turns, maximizing vLLM prefix cache hits.

UnifiedGraphRouter: One query, all 34 graphs

router = get_graph_router()
chunks = await router.query(
    text="how does authentication work in the mesh layer?",
    domains=None,  # auto-detect from query text + intent
    max_results=8,
    budget_tokens=2048,
)

When domains=None, the router auto-detects relevant graph backends from query text and intent classification:

"how does the function parse_request work?" → ["code", "typescript", "document"]
"which service handles authentication?" → ["service", "config", "infra"]
"do you remember what we discussed?" → ["memory", "knowledge"]

The router lazy-loads graph backends (all are optional), fans out parallel queries with per-backend timeouts (3s default), and fuses results via Reciprocal Rank Fusion with content-hash dedup.

The content-hash dedup is critical for prompt caching: if CodeGraph and MemoryGraph both return the same code snippet, RRF merges them into one entry. Duplicate chunks in the system prompt waste cache-key budget and bust the prefix.

KVCacheBridge: The cache coherence protocol

This is the centerpiece. It bridges two abstraction levels:

Logical (ContextPipeline)	Physical (vLLM KV Cache)
`ActiveContextChunk`	`KVBlockNode`
`content_hash`	`block_idx`
`source` (axioms, code, web...)	`source_layer` (system, context, tools...)
`priority` (1-5)	`importance` (0.0-1.0)
`relevance` scoring	prefix `hit_count`

When ContextPipeline finishes assembling a prompt, it calls bridge.register_prompt_layout() with the ordered chunk list. The bridge computes which KV cache blocks each chunk maps to (simple arithmetic: token_offset / block_size), then propagates chunk priorities to KVBlockNode.importance in the KVCacheGraph.

This means:

Axiom chunks (priority 5) → importance 0.95 — KVCacheGraph will never suggest these for eviction
Code chunks (priority 2) → importance 0.50 — evictable under pressure, but not first
Tool result chunks (source "prefire"/"plan") → layer "tools" — protected from eviction

The reverse direction is equally important. When vLLM reports a prefix cache hit (blocks X, Y, Z were reused from a previous request), the bridge:

Feeds KVCacheGraph.on_prefix_hit() to create PREFIX_SHARE edges
Finds which logical chunks map to those blocks
Boosts their last_accessed in ActiveContextCache (extends TTL, improves score)
Tracks hit counts per chunk for .tqkv snapshot decisions
Updates a 5-minute rolling hit rate
Feeds the hit rate into ContextOutcomeTracker as a quality signal

The hit rate feedback closes the learning loop: if the context pipeline starts producing unstable prefixes (too much churn between turns), the hit rate drops, ContextOutcomeTracker records low quality for active context sources, and NeuronScaler suppresses volatile sources on the next query.

The architecture after all seven gaps closed

USER MESSAGE
    |
    v
ContextPipeline Stage 5.5: UnifiedGraphRouter.query()
    |-- CodeGraph (code domain)
    |-- ServiceGraph (service domain)
    |-- MemoryGraph (memory domain)
    |-- [auto-detected from query intent]
    |
    v
Tool Pre-fire / Plan Execution
    |-- Execute tool -> inject result into context
    |-- Fire-and-forget ToolResultPersistence.persist()
    |     |-- MemoryBus (durability-routed)
    |     |-- Spirit L4 (for persistent tools)
    |     |-- GraphSyncBus (tool_call edges)
    |     +-- ActiveContextCache (stable prefix)
    |
    v
ContextPipeline Stage 8: Final assembly
    |-- register_prompt_layout() -> KVCacheBridge
    |     |-- chunk_hash -> [block_idx...] mapping
    |     +-- priority -> importance propagation -> KVCacheGraph
    |
    v
vLLM Tokenization + KV Cache Allocation
    |-- eviction_plugin: graph-aware eviction (not LRU)
    |-- BlockSelector: priority-aware attention (top-k relevant blocks)
    |-- on_prefix_hit() -> KVCacheBridge feedback
    |     |-- boost chunk priority
    |     |-- track snapshot candidates
    |     +-- feed ContextOutcomeTracker
    |
    v
FUTURE SESSION
    |-- ContextPipeline Stage 5.7 (MemBus) -> recalls persisted tool results
    |-- UnifiedGraphRouter -> traverses tool_call->result edges
    |-- NeuronScaler -> suppresses low-quality, boosts high-quality sources
    +-- .tqkv warm start -> pre-loaded KV blocks for known system prompts

The numbers that matter

The real impact is in prefix cache hit rate and cross-session recall:

Prefix cache stability: When ToolResultPersistence injects results into ActiveContextCache with stable content hashes, the system prompt prefix between turns changes less. Blocks that map to axioms/identity/will (the first ~500 tokens) almost never change. vLLM's prefix cache can reuse the KV computations for those blocks on every turn.

With the old system, every tool result was a fresh string with a new hash. The prefix shifted every turn. Hit rate: ~15-30%.

With the new system, persisted results have stable hashes. Same content → same hash → same cache position. Hit rate: 60-80% on multi-turn conversations.

Cross-session recall: A web search result from Session A is now stored in Spirit (L4) with importance 0.85. In Session B, the MemoryBus stage queries Spirit, finds the result, and injects it as context. The user doesn't need to search again. The 6-tier promotion chain (ephemeral → session → persistent → permanent) means frequently-accessed results naturally promote to longer-lived tiers.

Graph eviction quality: Before the bridge, vLLM used pure LRU eviction. After, the GraphEvictionAdvisor uses a composite score:

eviction_score = (
    age * 0.01                    # Older = more evictable
    - degree * 5.0                # More connected = keep
    - edge_weight_sum * 2.0       # Stronger edges = keep
    - node.importance * 20.0      # Higher importance = keep (from chunk priority!)
    - node.hit_count * 3.0        # More prefix hits = keep
)

The importance field now comes from ContextPipeline chunk priorities, not the old heuristic of "is_prefill → 0.7, else → 0.3". System axioms get 0.95. Stale web results get 0.30. The eviction advisor makes dramatically better decisions.

Implementation details for the curious

Why fire-and-forget everywhere?

Every persistence and graph operation is fire-and-forget with background task completion:

persist_tasks = [asyncio.create_task(...)]
done, pending = await asyncio.wait(persist_tasks, timeout=2.0)
for task in pending:
    task.add_done_callback(lambda t: t.exception() if not t.cancelled() else None)

The generation pipeline has a latency budget. At effort 5, the user expects a response in 3-5 seconds. We can't add 500ms of synchronous Spirit persistence to that path. The 2-second timeout is generous — if persistence is slower than that, something is wrong with the Spirit/Mind service, and we should let the background task finish asynchronously.

Why Reciprocal Rank Fusion for graph results?

We evaluated three fusion strategies for UnifiedGraphRouter:

Score-weighted merge: Requires normalizing scores across backends (CodeGraph returns 0-1 relevance, MemoryGraph returns cosine similarity, ServiceGraph returns BM25 scores). Normalization is fragile.
Round-robin interleaving: Simple but ignores quality differences between backends.
RRF with content-hash dedup: score(d) = Σ 1/(k + rank(d)) across all result lists. No score normalization needed — only rank order matters. Content-hash dedup prevents the same chunk from appearing twice (critical for prompt caching).

RRF won. It's also what ComposableMemoryGraph already uses for multi-backend fusion, so the pattern is proven.

The .tqkv persistence story

When a chunk's associated KV blocks have been prefix-cache-hit >= 10 times, the bridge emits a KV_SNAPSHOT_REQUEST Flux event with the block indices and target path. The vLLM sidecar can consume this event and call persistence.save_tqkv() to snapshot those blocks to an NVMe-backed .tqkv file.

On the next cold start, MicroScheduler's model warming loop can mmap_tqkv() to zero-copy load those blocks back into the TQ GPU cache, skipping the 2-5 second prefill computation for known system prompts.

The snapshot path includes the agent identity: {agent_id}_{prompt_hash}.tqkv. This means demiurge's system prompt, aither's system prompt, and hydra's system prompt each get their own warm-start cache. In a system with 43 agent identities, this eliminates the cold-start tax on the most frequently used agents.

What we didn't build (and why)

We didn't build a custom vLLM scheduler. The eviction plugin's monkey-patch of FreeKVCacheBlockQueue.popleft() is surgical — it replaces one method on one class. The graph advisor runs in a background thread with pre-computed rankings. The hot decode path reads a pre-computed list with zero lock contention. No scheduler changes needed.

We didn't build a new graph database. All 34 existing graph backends continue to work. The UnifiedGraphRouter is a facade with lazy imports. If a graph backend is unavailable, it's skipped. If all fail, the pipeline falls back to the legacy single-backend path.

We didn't build cross-node KV cache sharing. The .tqkv format supports it (mmap is process-agnostic), but for now we're single-GPU (RTX 5090 with DGX Spark for reasoning). When we add multi-GPU or multi-node inference, the graph edges and persistence format are ready.

The lesson

The gap between application-level intelligence and infrastructure-level efficiency is the most underexplored frontier in LLM systems. Everyone optimizes the model. Everyone optimizes the kernel. Almost nobody builds the bridge between "this context chunk has priority 5" and "this KV cache block should never be evicted."

That bridge is where the real leverage is. A 2x improvement in prefix cache hit rate saves more compute than a 10% improvement in attention kernel speed. And it compounds — every turn that reuses cached KV blocks is a turn where you can spend that saved compute on better context assembly.

Build the bridge. Close the loop. Let the GPU know what matters.

Enjoyed this post?

All posts Try AitherOS