There is a specific failure mode in long-context LLM inference that nobody talks about because it is invisible. Your model is generating a response. It runs out of VRAM for KV cache blocks. The inference engine evicts some blocks to make room. The blocks it evicts happen to contain your system identity — the axioms, rules, and personality that define who the model is pretending to be.

The model does not crash. It does not throw an error. It simply stops being itself mid-sentence. The persona dissolves. The safety constraints vanish. The carefully constructed context pipeline you spent months building gets silently eaten by a FIFO queue that cannot tell the difference between a core identity axiom and the third paragraph of a user's copy-pasted email.

This is the default behavior of every major inference engine. vLLM, TGI, llama.cpp — they all use variants of LRU or FIFO for KV cache block eviction. The eviction policy operates at the memory management layer. It has no concept of what the blocks contain.

We fixed this.

The Architecture of Forgetting

To understand why this matters, you need to understand how AitherOS builds context.

Every LLM call goes through a 12-stage ContextPipeline. It assembles a system prompt in layers:

[AXIOMS] → [IDENTITY] → [RULES] → [CAPABILITIES] → [CONTEXT] → [MEMORIES] → [AFFECT]

Each layer has different importance. Axioms are non-negotiable. Identity defines the agent's persona. Rules enforce safety. Memories provide conversational continuity. Generation tokens are the model's output — ephemeral by nature.

When this prompt hits vLLM, the tokenized content gets stored in KV cache blocks. Each block is 16 tokens of key-value attention state. A 4096-token system prompt occupies 256 blocks. vLLM sees them as undifferentiated integers in a block table. Block 47 could be your safety constraints or it could be the word "the" in a throwaway sentence. The block allocator does not know and does not care.

Under VRAM pressure — which happens constantly when you are running 329K-token TQ35 contexts on a single GPU — blocks get evicted. The standard policy is: evict the least recently used. If your system prompt was loaded 30 seconds ago and the user has been generating for 20 seconds, guess which blocks are "least recently used."

Your identity.

KVCacheGraph: A Relationship Graph Over Physical Memory

The solution required building a new data structure: a graph over KV cache blocks that encodes what they mean and how they relate to each other.

KVCacheGraph is a faculty graph — it follows the same BaseFacultyGraph pattern as our CodeGraph, MemoryGraph, and 13 other domain-specific graphs. But instead of indexing code or memories, it indexes physical KV cache blocks.

Every block becomes a node with metadata:

@dataclass
class KVBlockNode:
    block_idx: int
    source_layer: str           # axioms, identity, memories, user_turn, generation
    importance: float           # 0-1 from context budget scoring
    token_range: Tuple[int, int]
    tenant_slug: str
    last_attended: float
    hit_count: int              # Prefix cache hits
    is_spilled: bool            # Currently in DDR5 cold tier
    embedding: Optional[List[float]]  # Block representative vector

The source_layer field is the critical piece. It comes from the ContextPipeline — when the pipeline assembles the system prompt, it tags each section with its layer name. The TierCacheBridge propagates these tags down to the block metadata table, which registers them in the graph.

But metadata alone is not enough. The graph also tracks six types of edges between blocks:

PREFIX_SHARE — Blocks reused across requests via vLLM's prefix caching. If block 12 appears in both request A and request B's prefix, that block is structurally important. Evicting it would force recomputation for both requests.

CO_ATTEND — Blocks that frequently appear together in the same attention step. After 3 co-occurrences, they get an edge. After more, the edge weight increases. This captures attention patterns that the model itself is telling us are related.

SEMANTIC — Blocks with similar key vector embeddings (cosine similarity > 0.8). When a block is added with an embedding, we scan its neighbors and create semantic edges. This clusters blocks by content similarity, so evicting one block from a semantic cluster does not strand the others.

TEMPORAL — Consecutive blocks from the same generation sequence. These represent sequential reasoning — evicting a block from the middle of a chain breaks the model's train of thought.

SPILL_LINK — Links between hot VRAM blocks and their cold DDR5 copies. This is how the TierCacheBridge tracks what has been spilled and what can be warmed back.

MEMORY_REF — Links to MemoryGraph nodes. When a KV cache block contains content that came from episodic memory, this edge connects the physical block to the semantic memory it represents.

The Eviction Scoring Formula

With this graph, eviction becomes an optimization problem over a scored graph rather than a blind FIFO queue.

The suggest_eviction() method computes a composite score for every non-protected, non-spilled block:

score = (
    age * 0.01                    # Older = more evictable
    - degree * 5.0                # More connected = keep
    - edge_weight_sum * 2.0       # Stronger edges = keep
    - node.importance * 20.0      # Higher importance = keep
    - node.hit_count * 3.0        # More prefix hits = keep
)

Higher scores get evicted first. The weights encode a clear priority:

Source layer protection — Blocks from axioms, identity, and rules are excluded entirely. They cannot be evicted regardless of score. The model's core identity is structurally protected.
Importance from context scoring — The ContextPipeline already computes importance scores for every piece of content. A memory that was recalled 5 times has higher importance than one recalled once. This score propagates to the block level.
Graph connectivity — A block with 8 CO_ATTEND edges and 3 SEMANTIC edges is embedded in a web of relationships. Evicting it would degrade the attention patterns the model has built up. A lonely block with zero edges is an isolated fact — safe to evict.
Prefix cache value — Blocks that have been reused across multiple requests (high hit_count) represent shared computation. Evicting them wastes the work of every previous request that benefited from prefix caching.
Recency — All else being equal, older blocks are more evictable. This preserves the useful aspect of LRU without making it the only signal.

The result is that generation tokens with low connectivity get evicted first. System prompt blocks stay. Memory blocks with high importance stay. Prefix-shared blocks stay. The model keeps its identity.

GraphEvictionAdvisor: Zero-Latency Hot Path

Computing eviction rankings over a graph is not free. The suggest_eviction() method iterates over all nodes, computes edge weight sums, and sorts. On a graph with 10,000 blocks, this takes milliseconds. Milliseconds are unacceptable on the decode path — at 40 tok/s, each decode step has 25ms total budget, and the attention forward pass consumes most of it.

The GraphEvictionAdvisor solves this with a background thread pattern:

class GraphEvictionAdvisor:
    def __init__(self, interval=0.5, max_stale=2.0):
        self._eviction_ranking: Optional[List[int]] = None
        self._ranking_ts: float = 0.0

A daemon thread recomputes the eviction ranking every 500ms. The result is stored as an atomic reference. The hot decode path reads this reference — no lock, no mutex, no blocking. If the ranking is older than 2 seconds (max_stale), the caller gets None and falls back to FIFO. This is a graceful degradation, not a failure.

The decode path integration in vllm_hooks.py is minimal:

# Every 10 decode steps, ask advisor for prefetch candidates
_decode_counter += 1
if _decode_counter % 10 == 0:
    advisor = get_graph_eviction_advisor()
    if advisor is not None:
        prefetch = advisor.get_prefetch_candidates(active, 8)
        if prefetch:
            _async_warm_blocks(prefetch)

Every 10 decode steps (roughly every 250ms at 40 tok/s), the advisor checks which cold-tier blocks are graph-neighbors of the currently active blocks and triggers an async warm. The blocks are moved from DDR5 to VRAM before they are needed.

This is predictive prefetch. The graph tells us what the model is likely to attend to next based on co-attention patterns, temporal sequences, and semantic similarity. Instead of waiting for a cache miss and stalling the decode loop, we preemptively warm the blocks.

The TierCacheBridge Integration

The GraphEvictionAdvisor does not operate in isolation. It connects to the ContextTierManager through the TierCacheBridge — the same bridge that already handles hot/cold tier transitions.

When the ContextTierManager promotes content (it was recalled and should be in VRAM), the bridge calls cache.warm_blocks() and notifies the graph via on_warm(). When content is demoted (it has not been accessed in a while), the bridge spills the blocks to DDR5 and notifies via on_spill().

The graph tracks these state transitions. When a block is spilled, is_spilled = True. The prefetch algorithm only suggests spilled blocks — it would be pointless to suggest blocks already in VRAM. When a block is warmed back, the flag clears.

The bridge also exposes select_spill_candidates() — when the GPU cache needs to free blocks, it asks the graph first:

def select_spill_candidates(self, n_blocks: int) -> List[int]:
    graph = self._get_kv_graph()
    if graph is not None:
        candidates = graph.suggest_eviction(n_blocks)
        if candidates:
            return candidates
    return []  # Caller falls back to default selection

If the graph is available, spill decisions are intelligent. If not, the system degrades to standard LRU. This opt-in pattern (AITHER_KVCACHE_GRAPH=1) means the entire system is zero-risk to enable.

What Changes in Practice

The concrete impact:

Identity stability in long sessions. Before GraphEvictionAdvisor, we observed persona drift in conversations that exceeded 80% VRAM utilization. The model would subtly lose personality traits, forget safety constraints, or start responding in a different tone. The axiom and identity blocks were being evicted under pressure. Now they are structurally protected — they cannot be eviction candidates regardless of VRAM pressure.

Faster cold-tier recovery. When a user references something from earlier in a conversation, the relevant memory blocks need to be warmed from DDR5 back to VRAM. Without prefetch, this is a synchronous operation that stalls generation. With the advisor's predictive prefetch, graph-neighbor blocks are already warming by the time the model needs them. The 10-step prefetch check means blocks start warming ~250ms before they are required.

Better prefix cache utilization. Blocks with high prefix cache hit counts are protected from eviction. In multi-user scenarios where many requests share a common system prompt prefix, this means the shared prefix stays in VRAM instead of being repeatedly evicted and recomputed.

Semantic coherence under pressure. The semantic edge type means blocks with similar content cluster together in the graph. When eviction does happen, it removes the least-connected subgraph — typically isolated generation tokens from completed responses. The semantically coherent blocks (related memories, connected context) remain as a unit.

The Numbers

The advisor background thread consumes ~0.1% CPU (one graph traversal every 500ms). Memory overhead is proportional to the number of blocks — roughly 200 bytes per node plus 64 bytes per edge. For a 329K-token context with ~20,000 blocks, that is approximately 8MB. On a 32GB VRAM GPU, this is noise.

The decode path overhead is zero — a pointer read and a timestamp comparison. No lock contention, no graph query, no blocking.

The prefetch accuracy depends on the graph's edge density. After the first few hundred decode steps, the co-attention tracking builds enough edges to make meaningful predictions. In our benchmarks, graph-informed prefetch reduces cold-tier stalls by roughly 40% compared to reactive warming.

Why This Matters Beyond AitherOS

The pattern — building a semantic graph over physical memory blocks — is general. Any system that manages KV cache blocks could benefit from knowing what those blocks contain. The insight is not complicated: if your eviction policy does not understand your content hierarchy, it will eventually evict the content that matters most.

Most inference systems treat KV cache management as a solved problem. It is not. It is a data structure problem that has been addressed with a memory management solution. FIFO works when all blocks have equal value. In any system with a structured context pipeline — system prompts, tool definitions, conversation history, injected memories — blocks have wildly unequal value. The eviction policy should reflect that.

The GraphEvictionAdvisor is opt-in (AITHER_KVCACHE_GRAPH=1), has zero decode-path overhead, degrades gracefully when disabled, and syncs with the rest of the AitherOS faculty graph infrastructure via BaseFacultyGraph. It turns VRAM management from a blind physical operation into a context-aware semantic one.

Your GPU should know what it is forgetting. Now ours does.

Enjoyed this post?

All posts Try AitherOS

Back to blog

engineeringgputurboquantcontext-managementinferencekv-cache

Your GPU Doesn't Know What It's Forgetting — Ours Does

April 3, 202614 min readAitherium