3-Tier KV Cache: When Your GPU Memory Becomes a Memory System
Earlier today we shipped TurboQuant -- a sub-byte KV cache quantizer that compresses attention state to 34 KB per token using 4-bit vector quantization, fused Triton kernels that read compressed data directly, and a two-tier async cache with VRAM and DDR5 tiers. That post was about the algorithm and the deployment. This post is about what happened next: the moment we realized that having 548,000 tokens of VRAM capacity and 3.9 million tokens of DDR5 capacity changed the entire architecture above the GPU.
The problem is not "how do we fit more tokens in memory." The problem is: when memory is no longer scarce, what do you do with all that space?
The Scarcity Architecture
AitherOS has a 10-layer system prompt built by build_system_message(). Every LLM call assembles it from scratch:
| Layer | Name | Budget |
|---|---|---|
| 0 | Axioms | ~200 tokens |
| 1 | Identity | ~300 tokens |
| 2 | System Knowledge | ~600 tokens |
| 3 | Capabilities | ~1500 tokens |
| 4 | System State | ~800 tokens |
| 5 | Current Time | ~80 tokens |
| 6 | Neuron Context | ~1600 tokens |
| 7 | Memories | ~1100 tokens |
| 8 | Affect | ~250 tokens |
| 9 | Awareness | ~400 tokens |
Total budget: roughly 6,000 tokens. Every layer goes through _synthesize_for_budget(), an extractive compressor that scores lines by importance, keeps the highest-scoring ones until the budget fills, and spills the rest to a 5-tier memory cascade (KernelContextBus -> WorkingMemory -> Spirit -> Graph). The system is built around the assumption that prompt tokens are expensive and KV cache is a fixed, single-tier resource.
That assumption was correct when FP16 cost 128 KB per token and FP8 cost 64 KB. You could not afford to send 50K tokens of context to the model even if you had 50K tokens of useful context. You had to synthesize.
TurboQuant changes the math.
The Numbers
With TQ4 compression at 34 KB per token:
- VRAM tier (20 GB allocated): 548,000 tokens of active KV cache. The fused Triton kernel reads packed uint8 indices + float32 norms directly -- no decompression buffer, no FP8 intermediary.
- DDR5 tier (128 GB system RAM): 3.9 million tokens in pinned memory. Same TQ4 format as VRAM. Warming is a bulk
memcpy-- 4 tensor copies across all 32 layers at once, because the 5D contiguous layout ([layer, block, pos, head, packed_dim]) makes cross-layer slicing an O(1) operation. - Recompute tier: only fires on cache miss (prefix never seen before).
The key insight: DDR5-to-VRAM warming costs the same as a memcpy because both tiers store identical TQ4-format bytes. There is no encode/decode step. A block that was in VRAM, spilled to DDR5, and warmed back is bit-identical.
So now we have a 4.4-million-token memory hierarchy where the hot tier holds half a million tokens and cold-to-hot promotion is essentially free. What does the software above the GPU need to change?
Six Components
We built six modules that bridge the application-level context pipeline to the physical cache tiers. Each one addresses a specific assumption that broke when memory stopped being scarce.
1. BlockMetadataTable: Giving Blocks a Name
The GPU cache is addressed by physical block index. Block 7,392 means nothing -- it is a position in a tensor. But the software that filled that block knows exactly what it contains: tokens 4,200 through 4,215 from the capabilities layer, importance score 0.85, tenant "platform."
BlockMetadataTable maps physical block indices to semantic metadata:
@dataclass
class BlockMeta:
source_layer: str # "axioms", "neurons", "memories", etc.
importance: float # 0-1 from extractive scoring
created_at: float
last_attended: float # updated from attention scores
eviction_priority: int # lower = evict first, 999 = pinned
token_range: tuple # (start, end) positions
tenant_slug: str
Eviction priority is computed from a source hierarchy:
axioms: 100 (never evict -- existential)
identity: 90
rules: 85
capabilities: 80
soul: 75
partner: 55
neurons: 40
memories: 35
affect: 20
user_turn: 15 (always evict first -- ephemeral)
This means when VRAM pressure triggers eviction, the system does not use LRU (which treats axioms and old chat turns equally). It evicts by semantic importance. The cheapest blocks go first. The most important blocks stay forever.
2. PrefixPinManager: The Blocks That Never Leave
Layers 0-3 of the system prompt (axioms, identity, rules, capabilities) are stable across turns. They change only when the agent's configuration changes, which is rare. Under the old architecture, these ~2,600 tokens were re-tokenized and re-computed on every single LLM call.
PrefixPinManager pins these blocks in VRAM permanently:
PINNABLE_SOURCES = frozenset({
"axioms", "identity", "rules", "capabilities", "soul"
})
On the first call, the cache-aware pipeline identifies which blocks belong to stable layers and pins them. Pinning sets eviction_priority = 999 in the metadata table. From that point forward:
- The fused Triton kernel reads these blocks every decode step (they are just normal KV cache entries).
- They never get spilled to DDR5.
- They never get re-tokenized.
- They never get re-computed.
The savings compound. If you make 100 LLM calls in a session and the stable prefix is 2,600 tokens, that is 260,000 tokens of redundant KV computation eliminated.
3. CacheAwareContextPipeline: Build Once, Warm Forever
This is the overlay that wraps build_system_message() with cache intelligence.
On the first call, it works exactly like before: call build_system_message(), get the full system prompt, tokenize everything, compute all KV. But it also:
- SHA-256 hashes the stable prefix (everything through
[/CAPABILITIES]). - Registers all blocks with
BlockMetadataTable. - Pins the stable prefix via
PrefixPinManager. - Saves the cache state.
On subsequent calls, it hashes the stable prefix again and compares. If the hash matches (which it will, because axioms do not change between chat turns), the pipeline knows those blocks are already in VRAM. It marks them as "reusable" -- no recompute needed. Only the dynamic suffix (layers 4-9) needs fresh KV computation.
But the real change is the token budgets. When the 3-tier cache is active, the pipeline returns expanded budgets:
| Layer | Old Budget | TQ Budget |
|---|---|---|
| Neurons | 1,600 | 8,000 |
| Memories | 1,100 | 6,000 |
| Partner | 2,000 | 4,000 |
| System State | 800 | 2,000 |
| Awareness | 400 | 1,200 |
The extractive compressor _synthesize_for_budget() still runs, but with 5x the budget, it keeps almost everything. The spillover system barely fires. The model sees richer, more complete context -- not because we changed the retrieval, but because we stopped throwing most of it away.
4. GraphBlockReserver: Pre-Staging Memory Neighborhoods
Currently, graph memory queries return a handful of nodes truncated to fit the 1,100-token memory budget. With 3.9 million tokens of DDR5, the query pattern changes fundamentally.
GraphBlockReserver sits between MemoryGraph.hybrid_query() and the TQGPUCache:
Query: "What did we decide about the auth rewrite?"
-> MemoryGraph returns 47 relevant nodes (episodic + semantic)
-> GraphBlockReserver estimates: ~12,000 tokens
-> Allocates 750 cold-tier block indices in DDR5
-> Stores ReservationResult with block metadata
-> On demand: warm_reservation() promotes blocks to VRAM
The reservation is a future: the blocks are allocated in DDR5 (cold tier) immediately, but they only move to VRAM when the model's attention actually needs them. This is speculative pre-staging -- we bet that the nodes returned by the graph query will be relevant, allocate them cheaply in DDR5, and promote on demand.
If the reservation is never warmed (the conversation went a different direction), we release the blocks. The cost of a wrong bet is a few cold-tier block indices. The benefit of a right bet is that 47 full memory nodes are available to the model instead of 5 truncated ones.
5. TierCacheBridge: Closing the Loop
The final piece connects the application-level ContextTierManager OODA loop to the physical cache.
The ContextTierManager runs a 60-second OODA cycle:
- Observe: scan spillover queue, tier occupancy, access patterns
- Orient: score items by frequency, recency, relevance
- Decide: build promotion/demotion/archive action list
- Act: execute tier moves
Previously, "act" meant moving content between in-process caches and vector stores. Now, TierCacheBridge hooks into the act phase and adds GPU cache operations:
| OODA Action | TierCacheBridge Response |
|---|---|
| Promote to Tier 0/1 | warm_blocks() DDR5 -> VRAM |
| Demote from Tier 0/1 | spill_blocks() VRAM -> DDR5 |
| Evict from Tier 1 | remove_blocks() free metadata |
The bridge also suggests threshold adjustments based on DDR5 capacity. If DDR5 is less than 50% full, it suggests lowering promotion thresholds (promote faster, warm faster -- memory is cheap). If DDR5 is over 80%, it raises demotion thresholds (keep blocks in VRAM longer, avoid thrashing).
This creates a closed feedback loop: the OODA cycle's decisions about which memories matter flow down to the GPU cache, and the GPU cache's capacity constraints flow back up to the tier manager.
6. StrataCacheShadow: Surviving Restarts
Everything above works beautifully within a single vLLM session. But sessions end. GPUs reset. Docker containers restart. When that happens, the DDR5 cold tier is gone -- and with it, every pre-staged graph neighborhood, every spilled prefix block, every warmed conversation history.
StrataCacheShadow solves this by periodically writing the cold-tier state to Strata's warm storage tier (aither://warm/kvcache/). The shadow is a faithful copy:
aither://warm/kvcache/{session_id}/manifest.json — block metadata + pin state
aither://warm/kvcache/{session_id}/k_packed.bin — raw TQ4 key data
aither://warm/kvcache/{session_id}/k_norms.bin — float32 key norms
aither://warm/kvcache/{session_id}/v_packed.bin — raw TQ4 value data
aither://warm/kvcache/{session_id}/v_norms.bin — float32 value norms
aither://warm/kvcache/index.json — session discovery index
The manifest carries everything the system needs to reconstruct the cache state: block indices, source layers, importance scores, token ranges, tenant slugs, and pin state. The binary files are the raw TQ4 tensors -- same bytes that live in DDR5.
Recovery is straightforward:
- On vLLM startup,
discover_sessions()reads the index from Strata. - For each session,
recover_session()loads manifest + binary data. - Blocks are written directly into the TQGPUCache cold tier (DDR5).
- BlockMetadataTable is repopulated from the manifest.
- PrefixPinManager re-pins stable prefix blocks.
- Next
warm_blocks()call promotes them to VRAM as usual.
The cost of shadowing is modest. A typical session with 10,000 cold-tier blocks at TQ4 compression is roughly 340 MB of data. Strata's warm tier is backed by NVMe with multi-GB/s bandwidth. Writing takes under a second; reading takes about the same. For a 60-second shadow interval, that is less than 2% overhead.
The deeper value is that the KV cache is no longer ephemeral. A user's conversation context, agent memories, and prefix state survive across vLLM restarts, GPU resets, and even host reboots. The 3-tier cache becomes a 4-tier cache: VRAM -> DDR5 -> NVMe (Strata) -> Recompute.
The 5D Layout
All of this works because of a single design decision in the TQGPUCache: the 5D contiguous tensor layout.
_k_packed = torch.zeros(
num_layers, max_blocks, block_size, num_kv_heads, packed_dim,
dtype=torch.uint8, device="cuda")
Five dimensions: [layer, block, position, head, packed_dim].
Why this matters:
- Cross-layer bulk operations:
_k_packed[:, block_indices]selects blocks across all 32 layers in one slice. Spilling 100 blocks is 4 tensor copies (K packed, K norms, V packed, V norms), regardless of layer count. Without the 5D layout, you would need 4 x 32 = 128 copies. - Per-layer views are free:
_k_packed[layer_idx]returns a contiguous 4D view. The fused Triton kernel takes this directly -- no copy, no reshape. - DDR5 mirrors VRAM exactly: the cold tier uses the same 5D shape in pinned memory. Warming is literally
tensor[:, indices].to(device, non_blocking=True). No format conversion. No re-encoding.
The 5D layout is what makes the bridge viable. Without it, warm_blocks() and spill_blocks() would be expensive per-layer Python loops, and the OODA cycle could not afford to trigger them on every promotion.
What Changes for the User
From the model's perspective, nothing changes about how attention works. The Triton kernel still does online softmax over packed uint8 blocks. Queries are still rotated, split into even/odd halves, and scored against codebook-gathered K vectors.
What changes is what the model sees. Instead of 6K tokens of aggressively compressed context with lossy extractive synthesis, the model gets up to 25K tokens of rich, minimally compressed context:
- Full neuron search results instead of top-3 truncated.
- Complete conversation history instead of last-4-turns.
- Entire graph memory neighborhoods instead of 5 surface nodes.
- Uncompressed partner knowledge with full examples.
And the stable prefix -- the 2,600 tokens that define who the agent is, what it can do, and how it should behave -- never burns inference compute. It is computed once and pinned.
The Deeper Point
There is a broader design pattern here. GPU memory systems are evolving toward the same tiered architecture that operating systems have used for decades: fast cache (SRAM / VRAM), main memory (DRAM / DDR5), and disk (SSD / recompute). The hardware gives you the tiers; the software has to decide what lives where.
The traditional approach in LLM serving treats KV cache as an opaque, flat buffer managed entirely by the inference engine. The application above it -- the context pipeline, the memory graph, the agent framework -- has no visibility into or control over what is cached, what is evicted, and what is recomputed.
What we built collapses that boundary. The context pipeline knows that axioms are pinned in VRAM blocks 0-12. The memory graph knows that 47 relevant nodes are pre-staged in DDR5 blocks 8,000-8,750. The tier manager knows that promoting a memory from WorkingMemory to the system prompt should also warm the corresponding KV blocks from DDR5. The GPU cache is no longer opaque -- it is a participant in the semantic memory hierarchy.
This is not novel in concept. CPU caches have had quality-of-service annotations for years. NUMA-aware allocators give applications control over memory placement. What is novel is applying these ideas to LLM inference, where the "memory" is the model's attention context and the "application" is an agent framework that knows which memories matter.
Code
All five components are in lib/gpu/turboquant/:
| File | What |
|---|---|
block_metadata.py | BlockMetadataTable + PrefixPinManager |
graph_block_reserver.py | GraphBlockReserver + ReservationResult |
cache_aware_context.py | CacheAwareContextPipeline + CacheState |
tier_cache_bridge.py | TierCacheBridge (OODA -> GPU cache) |
strata_shadow.py | StrataCacheShadow (crash recovery via Strata) |
vllm_custom_backend.py | TQGPUCache with 5D layout (existing) |
Entry points via lib.gpu.turboquant:
from lib.gpu.turboquant import (
get_block_metadata_table,
get_prefix_pin_manager,
get_graph_block_reserver,
get_cache_aware_pipeline,
get_tier_cache_bridge,
get_strata_cache_shadow,
)
The components are designed to degrade gracefully. If the TQ cache is not running (CPU inference, Ollama, cloud API), all methods return empty results and the pipeline falls through to the existing build_system_message() with the original token budgets. No code changes are required upstream. The integration is purely additive.
What Is Next
Two directions:
-
Attention-score-based eviction: the
last_attendedfield inBlockMetais populated but not yet used for eviction decisions. The plan is to instrument the Triton kernel to export per-block attention norms, then use them to identify blocks the model is not attending to. Low-attention blocks get spilled even if their semantic importance is high -- if the model is not looking at it, it does not matter how important we think it is. -
Semantic prefetch:
GraphBlockReservercurrently allocates cold blocks reactively (after the query). The next step is hooking into the speculative prefetch system -- when the user starts typing, NanoGPT predicts the likely intent, and the reserver pre-stages relevant graph neighborhoods in DDR5 before the query arrives. By the time the model needs those memories, they are already warm.
The GPU gave us the tiers. The math gave us the compression. The bridge gives us the semantics. Now we have a memory system.