Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

architecturegpuinferencequantizationkv-cachevllmdeep-diveperformanceblackwellspeculative-decodingcapstone

The Full Stack: How Six Optimizations Turned One GPU into a Datacenter

April 13, 202622 min readDavid Parkhurst

The Full Stack: How Six Optimizations Turned One GPU into a Datacenter

April 13, 2026 · David Parkhurst

Aither forgot its own name on a Tuesday.

Not in a dramatic, HAL-9000 way. Nothing crashed. No alarms fired. A user asked a question during a multi-user session, and the response started with "As an AI assistant, I can help you with that." Not "I'm Aither." Not the personality that eight context layers of identity, axioms, and soul configuration had been carefully assembled to produce. Just... generic. A language model with no idea who it was supposed to be.

The system was running 87 Docker containers. Real-time voice pipeline. Agent orchestration across 29 personas. A context pipeline that assembled 10+ layers of identity, memories, capabilities, and rules into every prompt. And the model was saying "As an AI assistant" because the KV cache was 82% full and the eviction policy was LRU.

LRU — least recently used. The same algorithm your browser uses to decide which tabs to unload. It does not read importance scores. It does not know that an axiom block tagged importance=100 contains the model's core identity, or that the block it is about to evict in favor of a throwaway assistant response tagged importance=15 is the reason the model knows its own name. LRU sees timestamps. The axiom block was written at the start of the session and never touched again (because axioms do not change). The assistant response was generated thirty seconds ago. LRU evicts the axiom block. The model forgets who it is.

I stared at the logs for twenty minutes before I understood what was happening. The GPU was fast. The compression was working. The cache was huge. Everything was technically correct. And the model had no identity.

This is the thesis of this entire post, and I want to state it early: component optimization is not system optimization. You can build six individually excellent optimizations and still have a system that forgets its own name. The compound effect — where layers amplify each other in ways nobody designed — is the actual product. The individual layers are just prerequisites.

The Thesis

Over three months, I built six optimizations. Each one solved a specific bottleneck. Together, they solved a different problem entirely.

Here is the before and after. One table. The punchline on page one.

	Before	After
Addressable tokens	80K	4.7M
Models loaded	1	2
Single-stream decode	85 tok/s	350-510 tok/s
Aggregate throughput	200 tok/s	3,400+ tok/s
System prompt budget	6K tokens	25K+ tokens
Follow-up handling	Broken	3-turn carry-forward
Persona drift	At 80% VRAM	Eliminated
Quality (GSM8K)	84.5%	84.5% (lossless)

Every row in the "After" column requires multiple layers cooperating. No single optimization produces any of these numbers alone. The 4.7M tokens require compression and tiered memory. The 350 tok/s requires speculative decoding and rich context (which requires tiered memory and prefix pinning). The eliminated persona drift requires block metadata and tiered overflow and graph eviction — three layers collaborating to fix a bug none of them caused individually.

That is the story. Not six optimizations. One system.

The Six Layers in 60 Seconds

If you have read the individual posts, skip this section. If you have not, this is orientation — one paragraph per layer, enough to follow the rest of the post.

Layer 1: TurboQuant. Sub-byte KV cache quantization. Rotates key/value vectors into a quantization-friendly space, then compresses to 4-bit using Lloyd-Max codebooks. 3.8x compression. 68 bytes per token instead of 256. Published as aither-kvcache on PyPI. Full post →

Layer 2: 3-Tier Cache. Treats KV cache as a memory hierarchy: VRAM (548K tokens, <1ms) → DDR5 (3.9M tokens, ~5ms) → recompute (unlimited, ~50ms). Six bridging components connect the physical cache to the application. Full post →

Layer 3: Graph Eviction Advisor. Builds a graph over KV cache blocks with six edge types (prefix share, co-attend, semantic, temporal, spill link, memory ref). Evicts isolated subgraphs. Protects highly-connected blocks. Makes identity blocks structurally unevictable. Full post →

Layer 4: Dual-Model Partitioning. VRAM zone allocation that fits both Nemotron-8B-AWQ (orchestrator) and DeepSeek-R1-14B-AWQ (reasoning) on one RTX 5090. Only possible because TQ4 compression shrinks KV caches from 5.0 GiB to 1.32 GiB combined. Full post →

Layer 5: DFlash. Block diffusion speculative decoding. Drafts 16 tokens simultaneously using iterative denoising (not sequentially like EAGLE). 87 MB draft model, 80-92% acceptance rate, 4-6x single-stream speedup. Mathematically lossless. Full post →

Layer 6: Context Pipeline. Five bugs across three files, fixed in 92 lines. Follow-up questions were losing their intent, dropping to bare LLM with no tools or context. The entire GPU stack delivering wrong answers because the application layer classified a follow-up as a generic question. Full post →

Tracing a Request Through the Full Stack

Let me show you what happens when you type a seven-word follow-up question into a system running all six layers.

The setup: a user asked "what are the best ways to watch the 2028 Olympics?" two minutes ago. The system routed to web search, effort 5, MCTS planning, and returned comprehensive results with sources, ticket information, and streaming options. Now the user types:

"How do I get tickets?"

Seven words. No explicit mention of the Olympics. No URL. Just a pronoun reference to the previous conversation. Here is the full journey through all six layers.

Step 1: The Context Pipeline Catches the Follow-Up

ChatEngine receives the message. Before Layer 6's fix, this is where everything went wrong — the intent classifier saw a short, generic question with no strong signals, scored it as general_knowledge at effort 2, and routed to the bare LLM. No web search. No tools. No context. The model hallucinated an answer from training data.

After the fix, three things happen. First, the _TOOL_INTENTS set now includes web_research and research, so the prior turn's intent propagates forward. Second, active_intent from the session state is forwarded to UnifiedChatBackend.think() — prior intent, prior tools, and remaining turns all flow through the context dictionary. Third, the safety net checks prior intent state and bumps effort to 5 when carry-forward is active.

The question is correctly classified as a follow-up to a web research conversation. Effort 5. Full pipeline. Tools enabled.

Step 2: Cache-Aware Context Assembly

CacheAwareContextPipeline runs. It detects a stable prefix — the system prompt, identity layers, axioms, rules, capabilities, and soul configuration. This prefix is ~2,600 tokens and has not changed since the session started. SHA-256 hash check confirms: identical to the cached version. No recomputation needed.

Because 2,600 tokens of prefix are pinned and free, the pipeline expands downstream budgets. Neuron search results: 1,600 → 8,000 tokens. Memory retrieval: 1,100 → 6,000 tokens. Partner knowledge: 2,000 → 4,000 tokens. The conversation history from two minutes ago — the full Olympics response with sources and ticket info — fits comfortably in the expanded budget instead of being truncated to the last four turns.

Step 3: System Message Assembly

build_system_message() assembles the full prompt. 10+ layers, 25K+ tokens:

[AXIOMS] — core behavioral constraints (importance 100)
Identity — "I am Aither, an AI agent operating system..."
[RULES] — operational boundaries
[CAPABILITIES] — available tools and their descriptions
[CONTEXT] — current system state, active services
[MEMORIES] — relevant episodic memories from the graph
[AFFECT] — emotional/persona configuration
[WILL] — current goals and intentions
[KNOWLEDGE] — neuron search results (8,000 token budget, not truncated)
[WEB SEARCH] — prior search results and sources
[RECENT CONVERSATION] — full conversation history including the Olympics query

25,000+ tokens of context. Before the tiered cache, this would have been 6,000 tokens — aggressively compressed, top-3 truncated, last-4-turns only.

Step 4: LLM Gateway → vLLM

The assembled prompt hits LLMGateway, which routes to AitherLLMQueue (in-process, no HTTP hop), which dispatches to the vLLM orchestrator instance.

But before vLLM ever loaded the model, something happened at process startup. The tq_sitecustomize.py file — a Python import hook sitting in the PYTHONPATH — intercepted builtins.__import__ and installed four phases of hooks. By the time this request arrives, TurboQuant's compression kernels are already wired into the attention backend, the engine knows about compressed block sizes, and DFlash's speculative proposer is ready to draft.

Step 5: TurboQuant Compresses the KV Cache

The 25K-token prompt generates KV cache entries. At FP16, this would cost 6.4 MB. At TQ4, it costs 1.7 MB — 68 bytes per token instead of 256. The compression happens inside the attention kernel: random orthogonal rotation spreads the energy across dimensions, then Lloyd-Max codebooks quantize to 4-bit. The attention computation happens in the rotated domain, so dequantization occurs once per query, not once per key.

The 3.8x savings are not just about fitting more tokens. They are the margin that makes everything downstream possible.

Step 6: 3-Tier Cache Manages Blocks

The tiered cache receives the new blocks. The 2,600-token prefix is already pinned in VRAM — no allocation needed. New conversation blocks go to hot tier. The Olympics response from two minutes ago, which would have been evicted under LRU pressure, was promoted from DDR5 warm tier when the follow-up question triggered carry-forward. Promotion latency: ~5ms. The user does not notice.

TierCacheBridge runs its OODA loop: observe spillover pressure (currently 64% VRAM utilization — comfortable), orient by frequency/recency/relevance, decide no promotions or demotions needed. The system is in steady state.

Step 7: Graph Eviction Advisor Protects Identity

The KVCacheGraph runs its background traversal every 500ms. It examines the current block layout and computes eviction scores:

score = (age × 0.01) - (degree × 5.0) - (edge_weight_sum × 2.0)
      - (importance × 20.0) - (hit_count × 3.0)

The axiom blocks (importance=100, source_layer: "axioms") score deeply negative. They are connected to every generation via PREFIX_SHARE edges, have high degree centrality, and carry maximum importance weight. They are structurally unevictable — not by policy override, but by the math. Even if VRAM hits 95%, these blocks survive. The model knows who it is.

The graph also runs predictive prefetch: examining co-attention edges from the current conversation, it identifies blocks from the previous Olympics response that are likely to be needed and warms them from DDR5. By the time the model's attention reaches those positions, the blocks are already in VRAM.

Step 8: Dual-Model Routing

This request goes to the orchestrator instance (Nemotron-8B-AWQ). But if the user had asked a math problem or a multi-step reasoning question, EffortScaler would have routed to the reasoning instance (DeepSeek-R1-14B-AWQ) on the same GPU. Both models are loaded simultaneously, sharing the 32 GB VRAM budget. TQ4 compression is what makes this fit — without it, the combined KV caches would require 5.0 GiB instead of 1.32 GiB, and the second model simply would not load.

Step 9: DFlash Drafts 16 Tokens at Once

The decode phase begins. Instead of generating one token per forward pass (85 tok/s baseline), DFlash takes over. The 87 MB draft model — a 4-layer, 256-hidden-dim transformer with bidirectional self-attention — takes 16 masked positions and denoises them in 4 steps using a cosine schedule: unmask [6, 5, 3, 2] positions per step, highest-confidence first.

Four forward passes through the tiny draft model produce 16 candidate tokens. The target model verifies all 16 in a single forward pass using rejection sampling: P(accept) = min(1, p_target(x) / q_draft(x)). Mathematically lossless — the output distribution is identical to the target model's regardless of draft quality.

Here is the critical interaction: the draft model's acceptance rate depends on how predictable the target model's output is. And predictability depends on context quality. With the full 25K-token context — complete conversation history, expanded neuron search, untruncated knowledge — the target model's output distribution is more structured, more grounded, less entropic. The draft model's acceptance rate climbs to 87% (compared to ~70% with the old sparse 6K context). That acceptance rate translates directly to speed: 350+ tok/s instead of ~250 tok/s.

Step 10: Response Arrives

The response arrives at 350+ tok/s. It references the specific Olympics information from the prior exchange. It uses the correct tone and personality. It suggests relevant tools (web search for current ticket availability). The user sees a coherent, contextual, fast response — unaware that 10 subsystems coordinated across GPU memory, system RAM, import hooks, graph traversals, and a speculative decoding model to produce it.

Total time from keystroke to first token: ~45ms. Total time for a 200-token response: ~570ms. On one GPU.

Where the Layers Multiply

Dependencies are boring. The interesting part is where the layers amplify each other in ways nobody designed.

DFlash Gets Faster When Context Gets Richer

This one surprised me. DFlash's acceptance rate is a function of draft-target distribution alignment. When the target model receives sparse context (the old 6K system prompt), its output distribution is high-entropy — many plausible next tokens, spread across a wide vocabulary. The draft model, working from the same sparse context, cannot predict well. Acceptance rate: ~70%. Effective speedup: 3-4x.

When the target model receives full context (25K+ tokens from the expanded pipeline), its output distribution narrows. The model is more certain. Fewer plausible continuations. The draft model's predictions align better. Acceptance rate: 85-92%. Effective speedup: 5-6x.

This is a free 1.5x multiplier that emerges from the interaction between Layer 2 (tiered cache enabling expanded budgets) and Layer 5 (speculative decoding). Neither layer was designed for this. The tiered cache was built to hold more tokens. DFlash was built to decode faster. But richer context makes speculative decoding more effective, because a better-informed model is a more predictable model.

The code that bridges them is DFlashKVExtractor.extract_from_tq_cache() — the method where DFlash reads TurboQuant-compressed cache entries for cross-attention:

def extract_from_tq_cache(
    self,
    tq_cache: Any,
    block_table: torch.Tensor,
    context_len: int,
    layer_idx: int = -1,
) -> Tuple[torch.Tensor, torch.Tensor]:
    """Extract K/V from TurboQuant compressed cache (decompress to FP16)."""
    if layer_idx < 0:
        layer_idx = tq_cache.num_layers + layer_idx

    k_cache, v_cache = tq_cache.decompress_layer(layer_idx)

    return self.extract_from_paged_cache(
        k_cache, v_cache, block_table, context_len, layer_idx,
    )

Twelve lines. DFlash calls tq_cache.decompress_layer() to get FP16 key/value tensors from 4-bit compressed storage, then runs its standard paged extraction. Two independent optimization systems — sub-byte quantization and speculative decoding — cooperating through a single method call. The approximation error from 4-bit quantization (MSE ~0.0095 per coordinate) actually helps slightly, preventing the draft model from overfitting to specific key/value representations.

Prefix Pinning Creates a Budget Cascade

The stable prefix is ~2,600 tokens. Pin it once and it never recomputes. Over 100 LLM calls, that eliminates 260,000 tokens of redundant KV computation. But the savings do not stop at computation.

The 2,600 tokens freed from the per-request budget cascade through the entire context pipeline. CacheAwareContextPipeline detects the pinned prefix via SHA-256 change detection, confirms no recomputation is needed, and reallocates the budget:

Neuron search results: 1,600 → 8,000 tokens (5x)
Memory retrieval: 1,100 → 6,000 tokens (5.5x)
Partner knowledge: 2,000 → 4,000 tokens (2x)
Conversation history: last 4 turns → full history

Richer context → more predictable target model → higher DFlash acceptance → faster decode. The prefix pinning (Layer 2) expands budgets (Layer 6), which improves speculative decoding (Layer 5), which was only possible because compression (Layer 1) made room for the pinned prefix in the first place. Circular reinforcement across four layers.

Persona Drift Required Three Layers to Fix

This is the bug from the opening paragraph, and it is the best example of why system-level thinking matters.

The persona drift happened because LRU eviction does not read importance scores. The axiom blocks — tagged importance=100, source_layer: "axioms" in BlockMetadataTable — were the oldest blocks in the cache (written at session start, never modified). Under VRAM pressure, LRU evicted them first. The model lost its identity.

Fixing this required three layers:

Layer 1 (BlockMetadataTable) tags every physical block with its semantic origin. Without this metadata, no eviction policy can distinguish axioms from throwaway responses. The KVBlockNode dataclass carries source_layer, importance, and token_range:

@dataclass
class KVBlockNode:
    block_idx: int
    source_layer: str        # axioms, identity, memories, user_turn, etc.
    importance: float        # 0-1 from context budget scoring
    token_range: Tuple[int, int]
    tenant_slug: str
    created_at: float = field(default_factory=time.time)
    last_attended: float = field(default_factory=time.time)
    hit_count: int = 0
    is_spilled: bool = False

Layer 2 (Tiered Cache) provides DDR5 overflow so that protecting axiom blocks does not create memory pressure. If you make axioms non-evictable but have nowhere for displaced blocks to go, VRAM fills up and allocation fails. The warm tier gives displaced blocks somewhere to live — a safety valve that makes the protection policy viable.

Layer 3 (Graph Eviction Advisor) makes axioms structurally unevictable. The eviction score formula weights importance at 20x, and axiom blocks also accumulate PREFIX_SHARE edges (high degree) from being part of every request's prefix. Even without the explicit non-evictable flag, the math makes it nearly impossible to evict them. With the flag, it is literally impossible.

Three layers. No single layer fixes the bug. BlockMetadataTable without the graph advisor is just unused metadata. The graph advisor without the tiered cache creates memory pressure it cannot relieve. The tiered cache without metadata is a dumb spill buffer that does not know what to protect. The fix is the interaction.

Two War Stories

I want to share two moments where everything was technically correct and nothing worked.

CUDA Graph Gibberish

TurboQuant v0.5 was mathematically correct. The rotated-domain attention kernel produced the right output on every unit test. Perplexity benchmarks matched the paper. I deployed it to vLLM and got gibberish.

Not high-perplexity text. Not slightly wrong answers. Gibberish. Random tokens. Corrupted output.

The problem was CUDA graphs. vLLM captures CUDA graphs — frozen snapshots of GPU execution — to eliminate kernel launch overhead. But CUDA graph capture requires that every kernel call produces the same sequence of GPU operations regardless of input. My TQ4 kernel used data-dependent control flow: different codebook lookups for different quantization bins. Every token broke the graph.

144 CUDA graph breaks per token. Each break forces vLLM to fall back from the cached graph to eager execution, re-record the graph, and resume. The overhead was catastrophic. Not just slow — the re-recording process corrupted state because the graph capture happened mid-computation with stale intermediate tensors.

The fix took three versions:

Version	What Changed	Graph Breaks/Token	Throughput
v0.5	Correct but naive	144	26 tok/s
v0.7	Eliminated data-dependent branches	36	87.9 tok/s
v0.8	Removed CPU-GPU synchronization stalls	0 (but 72 sync stalls)	224 tok/s
v1.0	Custom ops registered as CUDA-safe	0	382 tok/s
v1.3	Split-k parallelism across 170 SMs	0	589 tok/s

From 26 to 589 tok/s — a 22x improvement — without changing the quantization algorithm at all. The rotation matrices were identical. The codebooks were identical. The compression ratio was identical. Every single gain came from removing implementation overhead: graph breaks, synchronization stalls, kernel launch latency, and finally SM utilization.

A correct algorithm is necessary but nowhere near sufficient. I spent two weeks on the math and six weeks on making the math play nice with the GPU execution model.

The 92-Line Fix

The second war story is the opposite. No CUDA. No kernels. No papers. Five bugs across three files, and the entire GPU stack delivering wrong answers.

A user asks about the Olympics. Web search fires, effort 5, full context pipeline, great response. They follow up: "how do I get tickets?" The system routes to effort 2 — bare LLM, no web search, no tools, no context from the previous turn. The model hallucinates an answer. The user sees a fast, confident, completely wrong response.

I spent a day looking at GPU metrics, cache hit rates, and attention patterns before I realized the problem was in the intent classifier. A broad regex pattern (^how\b) was scoring 0.80 confidence, beating the specific WEB_RESEARCH patterns at 0.65. The word "tickets" had no signal in the intent vocabulary. The carry-forward mechanism did not include web_research in its set of tool intents. The session's active_intent was never forwarded to the chat backend.

Five bugs. Three files. 92 lines of fix:

Specificity boost: specific patterns get +0.05 over broad patterns
Added web_research to the neuron vocabulary
Added web_research and research to _TOOL_INTENTS
Forwarded active_intent through the context dictionary
Safety net checks prior intent state for effort escalation

No compression ratios. No kernel optimizations. No VRAM accounting. Just application logic that drops the ball on a pronoun reference.

The lesson: infrastructure does not matter if the application drops the ball. You can have 4.7 million tokens of addressable cache, 350 tok/s decode speed, graph-aware eviction, and two models running simultaneously — and the user will still say "it forgot what I just told it" if the intent classifier misroutes a follow-up question. The GPU stack is the foundation. The context pipeline is the product.

The sitecustomize.py File

There is one file that tells the story of this entire stack better than any benchmark table.

scripts/tq_sitecustomize.py is 148 lines. It is a Python import hook — it intercepts builtins.__import__ at vLLM process startup and installs four phases of modifications as specific vLLM modules load. Two completely independent optimization systems (TurboQuant and DFlash) coexisting in one process through coordinated import interception.

The file starts by checking environment variables:

_TQ_BITS = int(os.environ.get("AITHER_TQ_BITS", "0"))
_TQ_MODE = os.environ.get("AITHER_TQ_MODE", "")
_TQ_PRIMARY = _TQ_MODE.endswith("-primary") if _TQ_MODE else (
    os.environ.get("AITHER_TQ_PRIMARY", "0") == "1"
)

If AITHER_TQ_BITS is 2, 3, or 4, the hook activates. It replaces builtins.__import__ with _tq_import_hook, which watches for four specific module names and acts when they appear:

Phase 1 fires when vllm.v1.attention.backends.triton_attn loads. In PRIMARY mode, it monkey-patches TritonAttentionImpl.forward() to route through TQ's rotated-domain attention kernel. This is the compression — every attention computation now operates on 4-bit quantized KV entries with on-the-fly rotation:

if not _backend_registered and name == "vllm.v1.attention.backends.triton_attn":
    _backend_registered = True
    try:
        if _TQ_PRIMARY:
            from lib.gpu.turboquant.vllm_hooks import apply_tq_hooks
            ok = apply_tq_hooks()
            pid = os.getpid()
            status = "OK" if ok else "FAILED"
            print(f"[TQ] pid={pid}: Hooks applied to TritonAttentionImpl ({status})",
                  file=sys.stderr)

Phase 2 fires when vllm.v1.kv_cache_interface loads. Engine capacity patches — telling vLLM's memory allocator about the compressed block sizes so it allocates the right number of blocks for 68-byte tokens instead of 256-byte tokens.

Phase 3 fires when vllm.v1.worker.gpu_model_runner loads. A safety net for the reshape patch, catching import ordering edge cases where Phase 2 ran before the model runner was available.

Phase 4 fires on the same module as Phase 3, but only when DFLASH_ENABLED=1. This is where the two systems meet:

if (not _dflash_applied
        and os.environ.get("DFLASH_ENABLED", "0") == "1"
        and name == "vllm.v1.worker.gpu_model_runner"):
    _dflash_applied = True
    try:
        from lib.gpu.dflash.vllm_hooks import apply_dflash_hooks
        ok = apply_dflash_hooks()
        pid = os.getpid()
        status = "OK" if ok else "STANDALONE"
        print(f"[DFlash] pid={pid}: Hooks {status} "
              f"(block={os.environ.get('DFLASH_BLOCK_SIZE', '16')}, "
              f"steps={os.environ.get('DFLASH_DIFFUSION_STEPS', '4')})",
              file=sys.stderr)

The _dflash_applied guard prevents double-patching. The AITHER_TQ_BITS=0 check at the top prevents the entire hook from activating in uncompressed mode. Two independent optimization stacks — sub-byte quantization and block diffusion speculative decoding — installed by the same import hook, each unaware of the other, each using the other's output.

TQ compresses the KV cache. DFlash reads the compressed cache via extract_from_tq_cache(). Neither system was designed to work with the other. They cooperate because they operate at different abstraction levels (storage format vs. decode strategy) and communicate through the standard KV cache interface.

148 lines. Four phases. Two optimization stacks. Zero coordination beyond a shared import hook. This is what system-level optimization looks like in practice — not a grand unified theory, but independent components that compose because they respect each other's interfaces.

What It Costs

The entire GPU optimization stack — from sub-byte quantization kernels to graph-aware eviction to block diffusion speculative decoding — has an overhead you could lose in a rounding error.

Resource	Overhead
Draft model (DFlash)	87 MB VRAM (0.3% of 32 GB)
Graph advisor	~8 MB RAM, ~0.1% CPU
Tier bridge	<2% CPU (60s OODA loop)
Cache shadow (crash recovery)	~340 MB disk
Block metadata	264 bytes/block (200 node + 64 edge)
Total code	7,500 LOC across 37 files
Total tests	490 (all passing)

The entire stack's overhead — 87 MB of VRAM, 8 MB of RAM, less than 2% CPU — is smaller than one LoRA adapter. The 7,500 lines of code have zero external dependencies beyond PyTorch, Triton, and the vLLM process they all run inside. Everything ships as part of the AitherOS monorepo and deploys with a single docker compose up.

What's Next

The stack is complete. Now I optimize the stack.

Three directions:

Online DFlash recalibration. The draft model should continuously improve from production traffic. Every accepted/rejected token is a training signal. I am building a background micro-trainer that fine-tunes the draft model during idle GPU cycles — the same Dark Factory infrastructure I use for LoRA training. The goal is a draft model that adapts to the user's actual conversation patterns, not just the calibration corpus.

Attention-score-based eviction. The graph eviction advisor currently infers block importance from metadata, graph structure, and access patterns. The next version will instrument the Triton attention kernel to report actual attention scores — direct evidence of which blocks the model is actually reading. Metadata approximates importance. Attention scores are importance.

Adaptive tier boundaries. The current tier thresholds (50% and 80% DDR5 utilization) are static. The TierCacheBridge should learn workload-specific boundaries — recognizing that a coding session has different access patterns than a creative writing session, and adjusting promotion/demotion aggressiveness accordingly.

Each of these is an incremental improvement to an existing layer, not a new layer. The system is built. The compound effects are real. The question now is how much further the interactions can amplify each other.

The Series

TurboQuant: Sub-Byte KV Cache from Paper to Production — April 6, 2026
3-Tier KV Cache: When Your GPU Memory Becomes a Memory System — March 27, 2026
Graph Eviction Advisor: Context-Aware VRAM Management — April 3, 2026
Running Two vLLM Instances on a Single GPU — April 8, 2026
DFlash: Block Diffusion Speculative Decoding — 6x Faster Inference — April 13, 2026
Why Your AI Forgets What You Just Said — April 11, 2026

Enjoyed this post?

All posts Try AitherOS

Back to blog

architecturegpuinferencequantizationkv-cachevllmdeep-diveperformanceblackwellspeculative-decodingcapstone

The Full Stack: How Six Optimizations Turned One GPU into a Datacenter

April 13, 202622 min readDavid Parkhurst

The Full Stack: How Six Optimizations Turned One GPU into a Datacenter

April 13, 2026 · David Parkhurst

Aither forgot its own name on a Tuesday.

The Thesis

Over three months, I built six optimizations. Each one solved a specific bottleneck. Together, they solved a different problem entirely.

Here is the before and after. One table. The punchline on page one.

	Before	After
Addressable tokens	80K	4.7M
Models loaded	1	2
Single-stream decode	85 tok/s	350-510 tok/s
Aggregate throughput	200 tok/s	3,400+ tok/s
System prompt budget	6K tokens	25K+ tokens
Follow-up handling	Broken	3-turn carry-forward
Persona drift	At 80% VRAM	Eliminated
Quality (GSM8K)	84.5%	84.5% (lossless)

That is the story. Not six optimizations. One system.

The Six Layers in 60 Seconds

If you have read the individual posts, skip this section. If you have not, this is orientation — one paragraph per layer, enough to follow the rest of the post.

Tracing a Request Through the Full Stack

Let me show you what happens when you type a seven-word follow-up question into a system running all six layers.

"How do I get tickets?"

Seven words. No explicit mention of the Olympics. No URL. Just a pronoun reference to the previous conversation. Here is the full journey through all six layers.

Step 1: The Context Pipeline Catches the Follow-Up

The question is correctly classified as a follow-up to a web research conversation. Effort 5. Full pipeline. Tools enabled.

Step 2: Cache-Aware Context Assembly

Step 3: System Message Assembly

build_system_message() assembles the full prompt. 10+ layers, 25K+ tokens:

[AXIOMS] — core behavioral constraints (importance 100)
Identity — "I am Aither, an AI agent operating system..."
[RULES] — operational boundaries
[CAPABILITIES] — available tools and their descriptions
[CONTEXT] — current system state, active services
[MEMORIES] — relevant episodic memories from the graph
[AFFECT] — emotional/persona configuration
[WILL] — current goals and intentions
[KNOWLEDGE] — neuron search results (8,000 token budget, not truncated)
[WEB SEARCH] — prior search results and sources
[RECENT CONVERSATION] — full conversation history including the Olympics query

25,000+ tokens of context. Before the tiered cache, this would have been 6,000 tokens — aggressively compressed, top-3 truncated, last-4-turns only.

Step 4: LLM Gateway → vLLM

The assembled prompt hits LLMGateway, which routes to AitherLLMQueue (in-process, no HTTP hop), which dispatches to the vLLM orchestrator instance.

Step 5: TurboQuant Compresses the KV Cache

The 3.8x savings are not just about fitting more tokens. They are the margin that makes everything downstream possible.

Step 6: 3-Tier Cache Manages Blocks

Step 7: Graph Eviction Advisor Protects Identity

The KVCacheGraph runs its background traversal every 500ms. It examines the current block layout and computes eviction scores:

score = (age × 0.01) - (degree × 5.0) - (edge_weight_sum × 2.0)
      - (importance × 20.0) - (hit_count × 3.0)

Step 8: Dual-Model Routing

Step 9: DFlash Drafts 16 Tokens at Once

Step 10: Response Arrives

Total time from keystroke to first token: ~45ms. Total time for a 200-token response: ~570ms. On one GPU.

Where the Layers Multiply

Dependencies are boring. The interesting part is where the layers amplify each other in ways nobody designed.

DFlash Gets Faster When Context Gets Richer

The code that bridges them is DFlashKVExtractor.extract_from_tq_cache() — the method where DFlash reads TurboQuant-compressed cache entries for cross-attention:

def extract_from_tq_cache(
    self,
    tq_cache: Any,
    block_table: torch.Tensor,
    context_len: int,
    layer_idx: int = -1,
) -> Tuple[torch.Tensor, torch.Tensor]:
    """Extract K/V from TurboQuant compressed cache (decompress to FP16)."""
    if layer_idx < 0:
        layer_idx = tq_cache.num_layers + layer_idx

    k_cache, v_cache = tq_cache.decompress_layer(layer_idx)

    return self.extract_from_paged_cache(
        k_cache, v_cache, block_table, context_len, layer_idx,
    )

Prefix Pinning Creates a Budget Cascade

The stable prefix is ~2,600 tokens. Pin it once and it never recomputes. Over 100 LLM calls, that eliminates 260,000 tokens of redundant KV computation. But the savings do not stop at computation.

Neuron search results: 1,600 → 8,000 tokens (5x)
Memory retrieval: 1,100 → 6,000 tokens (5.5x)
Partner knowledge: 2,000 → 4,000 tokens (2x)
Conversation history: last 4 turns → full history

Persona Drift Required Three Layers to Fix

This is the bug from the opening paragraph, and it is the best example of why system-level thinking matters.

Fixing this required three layers:

@dataclass
class KVBlockNode:
    block_idx: int
    source_layer: str        # axioms, identity, memories, user_turn, etc.
    importance: float        # 0-1 from context budget scoring
    token_range: Tuple[int, int]
    tenant_slug: str
    created_at: float = field(default_factory=time.time)
    last_attended: float = field(default_factory=time.time)
    hit_count: int = 0
    is_spilled: bool = False

Two War Stories

I want to share two moments where everything was technically correct and nothing worked.

CUDA Graph Gibberish

Not high-perplexity text. Not slightly wrong answers. Gibberish. Random tokens. Corrupted output.

The fix took three versions:

Version	What Changed	Graph Breaks/Token	Throughput
v0.5	Correct but naive	144	26 tok/s
v0.7	Eliminated data-dependent branches	36	87.9 tok/s
v0.8	Removed CPU-GPU synchronization stalls	0 (but 72 sync stalls)	224 tok/s
v1.0	Custom ops registered as CUDA-safe	0	382 tok/s
v1.3	Split-k parallelism across 170 SMs	0	589 tok/s

A correct algorithm is necessary but nowhere near sufficient. I spent two weeks on the math and six weeks on making the math play nice with the GPU execution model.

The 92-Line Fix

The second war story is the opposite. No CUDA. No kernels. No papers. Five bugs across three files, and the entire GPU stack delivering wrong answers.

Five bugs. Three files. 92 lines of fix:

Specificity boost: specific patterns get +0.05 over broad patterns
Added web_research to the neuron vocabulary
Added web_research and research to _TOOL_INTENTS
Forwarded active_intent through the context dictionary
Safety net checks prior intent state for effort escalation

No compression ratios. No kernel optimizations. No VRAM accounting. Just application logic that drops the ball on a pronoun reference.

The sitecustomize.py File

There is one file that tells the story of this entire stack better than any benchmark table.

The file starts by checking environment variables:

_TQ_BITS = int(os.environ.get("AITHER_TQ_BITS", "0"))
_TQ_MODE = os.environ.get("AITHER_TQ_MODE", "")
_TQ_PRIMARY = _TQ_MODE.endswith("-primary") if _TQ_MODE else (
    os.environ.get("AITHER_TQ_PRIMARY", "0") == "1"
)

If AITHER_TQ_BITS is 2, 3, or 4, the hook activates. It replaces builtins.__import__ with _tq_import_hook, which watches for four specific module names and acts when they appear:

if not _backend_registered and name == "vllm.v1.attention.backends.triton_attn":
    _backend_registered = True
    try:
        if _TQ_PRIMARY:
            from lib.gpu.turboquant.vllm_hooks import apply_tq_hooks
            ok = apply_tq_hooks()
            pid = os.getpid()
            status = "OK" if ok else "FAILED"
            print(f"[TQ] pid={pid}: Hooks applied to TritonAttentionImpl ({status})",
                  file=sys.stderr)

Phase 3 fires when vllm.v1.worker.gpu_model_runner loads. A safety net for the reshape patch, catching import ordering edge cases where Phase 2 ran before the model runner was available.

Phase 4 fires on the same module as Phase 3, but only when DFLASH_ENABLED=1. This is where the two systems meet:

if (not _dflash_applied
        and os.environ.get("DFLASH_ENABLED", "0") == "1"
        and name == "vllm.v1.worker.gpu_model_runner"):
    _dflash_applied = True
    try:
        from lib.gpu.dflash.vllm_hooks import apply_dflash_hooks
        ok = apply_dflash_hooks()
        pid = os.getpid()
        status = "OK" if ok else "STANDALONE"
        print(f"[DFlash] pid={pid}: Hooks {status} "
              f"(block={os.environ.get('DFLASH_BLOCK_SIZE', '16')}, "
              f"steps={os.environ.get('DFLASH_DIFFUSION_STEPS', '4')})",
              file=sys.stderr)

What It Costs

The entire GPU optimization stack — from sub-byte quantization kernels to graph-aware eviction to block diffusion speculative decoding — has an overhead you could lose in a rounding error.

Resource	Overhead
Draft model (DFlash)	87 MB VRAM (0.3% of 32 GB)
Graph advisor	~8 MB RAM, ~0.1% CPU
Tier bridge	<2% CPU (60s OODA loop)
Cache shadow (crash recovery)	~340 MB disk
Block metadata	264 bytes/block (200 node + 64 edge)
Total code	7,500 LOC across 37 files
Total tests	490 (all passing)

What's Next

The stack is complete. Now I optimize the stack.

Three directions:

The Series

TurboQuant: Sub-Byte KV Cache from Paper to Production — April 6, 2026
3-Tier KV Cache: When Your GPU Memory Becomes a Memory System — March 27, 2026
Graph Eviction Advisor: Context-Aware VRAM Management — April 3, 2026
Running Two vLLM Instances on a Single GPU — April 8, 2026
DFlash: Block Diffusion Speculative Decoding — 6x Faster Inference — April 13, 2026
Why Your AI Forgets What You Just Said — April 11, 2026

Enjoyed this post?

All posts Try AitherOS