There is a constraint that dominates everything else when you serve LLMs on consumer hardware: KV cache memory. Not weights. Not activations. The KV cache. For a 32-layer, 8-head model at FP16, every token you process costs 128 KB of VRAM that stays allocated for the entire generation. At 32 GB total, with half reserved for model weights and overhead, you get maybe 80K tokens of context. That is the ceiling, and we hit it months ago.

FP8 quantization pushed it to 118K. Better, but still not enough when your agents run multi-turn conversations with full system context, tool histories, and cross-session memory retrieval. We needed more. Substantially more. And then a paper appeared on arXiv that claimed exactly what we were looking for.

The Paper

"TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate" by Zandieh et al. (arXiv:2504.19874, April 2025) describes an algorithm with a set of properties that, on first reading, sound too clean to be real:

Data-oblivious: no calibration data needed. No representative dataset. No profiling runs.
Online: works on streaming tokens. No batch preprocessing.
Near-optimal: mean squared error within 2.7x of the information-theoretic lower bound.
Sub-byte: operates at 2, 3, or 4 bits per value.

The algorithm itself is elegant. Given a vector from the KV cache:

Compute the L2 norm and normalize to the unit sphere
Apply a random orthogonal rotation (precomputed once, shared across all tokens)
Scalar-quantize each coordinate using a precomputed Lloyd-Max codebook
Pack the resulting indices into a compact bit representation
Store the packed indices plus the scalar norm

Dequantization reverses steps 4 through 1. The key insight is step 2: rotating a vector onto the unit sphere makes each coordinate follow a Beta distribution that, for dimensions 64 and above, is well-approximated by a Gaussian. And for Gaussian-distributed scalars, the optimal quantizer is known exactly -- it is the Lloyd-Max quantizer, whose centroids and boundaries have been tabulated since the 1960s.

No learning. No calibration. Just geometry.

Within 48 hours of the paper appearing, the community had already produced a C/CUDA llama.cpp fork, a PyTorch implementation running on an RTX 3060, and a Triton kernel. We decided to build our own from scratch, validate it against the mathematics, and wire it into our vLLM inference stack. The result is aither-kvcache -- an open-source Python package (CC-BY-4.0) that you can install with pip install aither-kvcache and plug into any vLLM deployment.

What We Built

The full implementation is published as aither-kvcache on PyPI. The core library is 10 Python files, approximately 2,200 lines of code. Here is how it breaks down.

The Codebook

The foundation is a set of hardcoded Lloyd-Max codebooks for the standard normal distribution at 1 through 4 bits:

_STANDARD_CODEBOOKS: Dict[int, Dict] = {
    2: {
        "centroids": np.array([-1.510469, -0.452781, 0.452781, 1.510469]),
        "mse": 0.117517,
    },
    4: {
        "centroids": np.array([
            -2.733266, -2.069016, -1.618002, -1.256233,
            -0.942391, -0.656804, -0.388089, -0.128350,
            0.128350, 0.388089, 0.656804, 0.942391,
            1.256233, 1.618002, 2.069016, 2.733266,
        ]),
        "mse": 0.009497,
    },
}

These are the optimal MSE scalar quantizers for N(0,1), from Max (1960) and Lloyd (1982). To adapt them to the actual coordinate distribution N(0, 1/d), we scale by 1/sqrt(d). We also implemented a full Lloyd-Max solver using scipy's numerical integration (compute_codebook_scipy) and validated that our hardcoded values match the solver output to within 1e-4. That solver runs the iterative conditional-mean algorithm on the exact Beta distribution for lower dimensions, and the Gaussian approximation for d >= 64.

The Rotation

The rotation module provides two methods for generating the random orthogonal matrix:

def random_orthogonal(d: int, seed: int = 42, device: str = "cuda",
                      dtype: torch.dtype = torch.float32) -> torch.Tensor:
    gen = torch.Generator(device="cpu").manual_seed(seed)
    G = torch.randn(d, d, generator=gen, dtype=dtype)
    Q, R = torch.linalg.qr(G)
    diag_sign = torch.sign(torch.diag(R))
    diag_sign[diag_sign == 0] = 1.0
    Q = Q * diag_sign.unsqueeze(0)
    return Q.to(device=device)

This is the standard Mezzadri (2007) approach: QR decomposition of a random Gaussian matrix yields a matrix uniformly distributed over the orthogonal group under Haar measure. The sign correction on the diagonal ensures a proper rotation (determinant +1).

We also provide a Randomized Hadamard Transform (RHT) option: H * D_k * H * D_{k-1} * ... * H * D_1, where H is the normalized Walsh-Hadamard matrix and each D_i is a random diagonal sign matrix. With 3 or more rounds, this closely approximates a Haar-random orthogonal matrix. The RHT applies in O(d log d) via butterfly operations versus O(d^2) for a full matrix multiply, though for head_dim=128 the difference is negligible.

Bit Packing

Packing is mechanical but must be correct. Each bit-width has a different packing layout:

4-bit: pair values into nibbles, 2 values per byte. head_dim=128 produces 64 bytes.
3-bit: group 8 values into 24 bits (3 bytes). head_dim=128 produces 48 bytes.
2-bit: group 4 values into 1 byte. head_dim=128 produces 32 bytes.

The 3-bit case is the least obvious. We pack 8 three-bit indices into a 24-bit integer, then split it into 3 bytes:

def pack_3bit(indices: torch.Tensor) -> torch.Tensor:
    groups = D // 8
    idx = indices.to(torch.int32).reshape(*indices.shape[:-1], groups, 8)
    bits24 = torch.zeros(*idx.shape[:-1], dtype=torch.int32, device=indices.device)
    for i in range(8):
        bits24 = bits24 | ((idx[..., i] & 0x7) << (21 - i * 3))
    byte0 = ((bits24 >> 16) & 0xFF).to(torch.uint8)
    byte1 = ((bits24 >> 8) & 0xFF).to(torch.uint8)
    byte2 = (bits24 & 0xFF).to(torch.uint8)
    packed = torch.stack([byte0, byte1, byte2], dim=-1)
    return packed.reshape(*indices.shape[:-1], groups * 3)

All packing operations handle arbitrary batch dimensions. This matters because KV cache tensors come in shapes like [batch, seq, heads, dim] and we need to pack along the last axis without reshaping-away the batch structure.

Triton Kernels

For the GPU path, we wrote fused Triton kernels that combine quantization and packing into a single pass, avoiding intermediate tensor allocation. Here is the 4-bit quantize-and-pack kernel:

@triton.jit
def _quantize_pack_4bit_kernel(
    y_ptr, boundaries_ptr, packed_ptr, N,
    D: tl.constexpr, HALF_D: tl.constexpr,
):
    pid = tl.program_id(0)
    if pid >= N:
        return
    pair_offs = tl.arange(0, HALF_D)
    base = pid * D
    y_even = tl.load(y_ptr + base + pair_offs * 2)
    y_odd = tl.load(y_ptr + base + pair_offs * 2 + 1)

    # Vectorized searchsorted: count boundaries exceeded
    idx_even = tl.zeros([HALF_D], dtype=tl.int32)
    idx_odd = tl.zeros([HALF_D], dtype=tl.int32)
    for b in tl.static_range(15):
        boundary = tl.load(boundaries_ptr + b)
        idx_even += (y_even >= boundary).to(tl.int32)
        idx_odd += (y_odd >= boundary).to(tl.int32)

    packed = ((idx_even & 0xF) << 4) | (idx_odd & 0xF)
    tl.store(packed_ptr + pid * HALF_D + pair_offs, packed.to(tl.uint8))

One thread per vector. Loads even and odd coordinates, runs the searchsorted as a linear scan over 15 boundaries (faster than binary search at this width), and packs both nibbles into one byte. The dequantize kernel does the reverse: unpack nibbles, gather from the codebook, store interleaved.

We have fused kernels for 2-bit and 4-bit. The 3-bit Triton kernel is a stub that falls back to the PyTorch path. Bit-packing 8 three-bit values into 3 bytes inside Triton's programming model is doable but complex, and the 2-bit and 4-bit widths cover the practical sweet spots.

The Main Quantizer

The TurboQuant class ties everything together:

tq = TurboQuant(head_dim=128, bits=4, device="cuda")
packed, norms = tq.encode(kv_vectors)   # [..., 128] -> [..., 64] uint8 + [...] f32
decoded = tq.decode(packed, norms)       # [..., 64] uint8 + [...] f32 -> [..., 128] f16

It automatically selects the Triton path when CUDA is available and falls back to pure PyTorch otherwise. It carries the rotation matrix, codebook, and packing dispatch as instance state, so encoding and decoding are stateless operations on the input tensors.

Validation: Does the Math Hold?

This is where we stop trusting the paper and start trusting measurements. We ran encode-decode roundtrips on 10,000 random unit vectors at each bit-width and compared our measured MSE against the paper's theoretical bounds.

The theoretical framework: for b-bit quantization of vectors on S^{d-1}, the MSE lower bound is 1/4^b (information-theoretic), and the paper proves an upper bound of (3*pi/2)/4^b. The ratio between achieved MSE and the lower bound should be at most 2.7x.

Bits	Measured MSE	Theory Lower	Theory Upper	Ratio to Lower	Paper Claim
4-bit	0.0093	0.0039	0.0184	2.39x	at most 2.7x
3-bit	0.0340	0.0156	0.0736	2.17x	at most 2.7x
2-bit	0.1161	0.0625	0.2945	1.86x	at most 2.7x

Every bit-width falls comfortably within the claimed bounds. The ratios actually improve at lower bit counts, which aligns with theory: the Lloyd-Max quantizer is closer to optimal when there are fewer levels.

Compression ratios versus FP16 (each 128-dim vector is 256 bytes at FP16):

Bits	Packed Bytes	+ Norm (f32)	Total	Compression
4-bit	64	4	68	3.8x
3-bit	48	4	52	4.9x
2-bit	32	4	36	7.1x

The norm overhead (4 bytes per vector for a float32 scalar) is the tax you pay for data-obliviousness. A calibration-based quantizer could absorb the norm into the codebook, but then you need calibration data and lose the online property.

The Test Suite

92+ tests. 87 pass on our Windows development host (CPU path), 5 are GPU-only and skip cleanly. The tests cover:

Codebook correctness: centroids are sorted, symmetric, correctly scaled by dimension
Packing roundtrips: pack-then-unpack recovers exact indices for all three bit-widths
Rotation orthogonality: Q^T * Q = I to within floating-point tolerance
MSE bounds: measured distortion is within theoretical range
Compression ratios: byte counts match expectations
KV cache store/decompress roundtrips
Multi-layer cache operations
Block copy (for beam search / copy-on-write)
Memory estimation accuracy

The KV Cache Manager

With the core quantizer validated, we built a full KV cache manager targeting vLLM's paged attention interface:

cache = TurboQuantKVCache(
    num_blocks=2500, block_size=16,
    num_kv_heads=8, head_dim=128,
    num_layers=32, bits=4,
)

# Per-layer in model forward:
cache.store(layer_idx, keys, values, slot_mapping)
key_cache, value_cache = cache.decompress_layer(layer_idx)
# Feed key_cache, value_cache to standard paged attention

The memory layout is split into two regions:

Permanent compressed storage: all layers, all blocks, stored as packed uint8 indices plus float32 norms
Reusable decompression buffer: one layer's worth of FP16 KV data, shared across all layers

The buffer is the trick that makes this practical. You only need one layer decompressed at a time during the forward pass, so you allocate a single buffer and reuse it as you walk through the 32 layers. The buffer cost is amortized across all tokens and is small relative to the compressed storage.

The cache manager also provides selective decompression (decompress_blocks) for partially-filled caches, and block copy operations for beam search and copy-on-write -- both operating directly on the compressed representation without decompressing first.

Wiring It Into vLLM: The Hard Part

Our production setup: Nemotron-Orchestrator-8B on an RTX 5090 (32 GB VRAM), vLLM v0.15.1 with FlashInfer attention backend, gpu_memory_utilization=0.45, FP8 KV cache. This gives us 118,112 tokens of KV cache capacity.

The goal: make vLLM allocate blocks based on TQ-compressed page sizes instead of FP8 sizes, so it creates more blocks for the same VRAM budget.

vLLM v0.15 uses a multi-process architecture: the API server runs in pid 1, and the EngineCore workers run in separate processes (pid 115, 119, etc.). Both processes import vLLM's KV cache specification classes, and both need to be patched.

We solved this with a sitecustomize.py import hook:

# tq_sitecustomize.py -- placed on PYTHONPATH
# Patches vLLM BEFORE any module loads, in every process
import importlib
import sys

class TQImportFinder:
    def find_module(self, name, path=None):
        if name == "vllm.v1.kv_cache_interface":
            return self
        return None

    def load_module(self, name):
        # Load the real module first
        real_module = importlib.import_module(name)
        # Then patch it
        apply_tq_patches(real_module)
        return real_module

sys.meta_path.insert(0, TQImportFinder())

The hook intercepts vLLM's KV cache interface module at import time and patches real_page_size_bytes on both AttentionSpec and FullAttentionSpec. Both must be patched -- FullAttentionSpec shadows the parent class's property, so patching only AttentionSpec has no effect.

The patched property computes the TQ-compressed page size:

@property
def _tq_real_page_size_bytes(self):
    from lib.gpu.turboquant.packing import packed_size
    pd = packed_size(self.head_size, _TQ_BITS)
    return 2 * self.block_size * self.num_kv_heads * (pd + 4)

For Nemotron-8B with TQ4 (block_size=16, heads=8, head_dim=128): the standard FP8 page size is 32,768 bytes per block. Our TQ4 page size is 17,408 bytes. That ratio -- 1.88x more blocks for the same memory -- is the payoff.

With the patch active, vLLM calculated 222,336 tokens of KV cache capacity. Up from 118,112. An 88% increase, from a property override.

The Reshape Crash

The 222K allocation crashed.

Here is exactly why. When vLLM allocates KV cache, it does this:

# Inside vLLM's cache allocation
raw_bytes = page_size_bytes * num_blocks
kv_cache_tensor = torch.zeros(raw_bytes, dtype=torch.int8, device="cuda")

So far, correct. Our patched page_size_bytes is smaller, so raw_bytes is smaller for the same number of blocks. The total allocation fits in VRAM.

But then vLLM reshapes that raw tensor into the structured format that FlashInfer's CUDA attention kernel expects:

# vLLM's tensor reshape
kv_cache = raw_tensor.view(fp8_dtype).view([2, num_blocks, block_size, num_heads, head_dim])

This reshape expects 2 * num_blocks * block_size * num_heads * head_dim * sizeof(fp8) bytes. The standard shape. Our patched page size made the raw tensor smaller -- correctly sized for TQ storage -- but the reshape still expects the standard FP8 dimensions. The smaller tensor cannot fill the larger shape. The reshape fails with a size mismatch.

The fundamental issue: we told vLLM's allocator to reserve less memory per block, but we did not change the tensor format that the attention kernel reads from. vLLM's allocation path and its consumption path assume the same format, and we only patched the allocation side.

The Missing Piece

FlashInfer's paged attention kernel reads from [2, num_blocks, block_size, num_heads, head_dim] tensors in FP8 or FP16. It does not understand packed uint8 indices. To serve attention from TQ-compressed storage, we need to close the gap between our compressed format and the kernel's expected input.

There are three paths forward:

Option A: Fused Triton attention kernel. Write a custom attention kernel that reads directly from TQ-compressed storage. The kernel would: (1) unpack indices, (2) look up centroids, (3) apply inverse rotation, (4) rescale by norm, (5) compute attention scores -- all in a single fused pass. We already have the Triton dequantize kernels in triton_ops.py. Fusing them with attention score computation is the remaining work. This is the maximum-efficiency path: zero decompression buffer, zero memory overhead beyond the compressed storage.

Option B: Decompression buffer. Keep FlashInfer's standard attention kernel. Before each layer's attention computation, decompress the TQ cache into a standard-format buffer, run attention, then discard the buffer. This is what our TurboQuantKVCache.decompress_layer() already does. The problem: the buffer must be large enough to hold one full layer of decompressed KV data, and it scales with num_blocks, partially eating the compression savings.

Option C: Upstream contribution. Add TQ as a native kv_cache_dtype in vLLM, alongside FP8 and FP16. This would integrate at the allocation, storage, and attention levels simultaneously. It is the cleanest solution but requires coordination with the vLLM project.

We chose Option A.

The Fused Kernel: Rotated-Domain Attention

The breakthrough insight: you don't need to dequantize the keys at all. Instead of inverse-rotating each K vector, rotate the query forward once and compute dot products in the rotated domain:

score_i = ||k_i|| * (Pi @ q)^T @ y_hat_i / sqrt(d)

This is mathematically equivalent to q^T @ k_i / sqrt(d) but avoids the per-vector 128x128 matmul. The query rotation is a one-time cost. The per-token work is just a codebook lookup and a dot product.

The same trick works for value accumulation:

output = Pi^T @ (sum_i  w_i * ||v_i|| * y_hat_v_i)

Accumulate in the rotated domain, rotate back once at the end. Two 128x128 matmuls total, regardless of context length.

We built a PyTorch reference implementation (TQPagedAttentionRef in fused_attention.py) and validated it with 16 tests:

5 math equivalence tests: rotated dot product, norm-scaled scoring, even/odd split, V accumulation, online softmax -- all pass to 1e-5 tolerance
6 functional tests: single block, multi-block, partial last block, zero context, shape validation, fused-vs-decompress comparison -- all pass with >0.99 cosine similarity
5 numeric tests: large norms, single token, 2-bit/3-bit/4-bit bit-widths -- all pass

The fused kernel reads directly from packed[block_size, num_kv_heads, packed_dim] uint8 tensors and norms[block_size, num_kv_heads] float32 tensors. No decompression buffer.

The Architectural Insight

The original Triton kernel stub tried to do everything inside the kernel: rotation, even/odd splitting, codebook gather, attention, inverse rotation. It hit Triton's register limitations: you cannot index a register tensor by a computed value (q_rot[i]), and even/odd slicing (q_rot[0::2]) is unreliable across Triton versions.

The fix is architectural, not algorithmic. Move rotation and interleaving outside the kernel into PyTorch, where they are trivial:

Pre-processing (PyTorch):
  q_rot  = query @ rotation.T           # one matmul per batch
  q_even = q_rot[..., 0::2]             # stride slice
  q_odd  = q_rot[..., 1::2]

Kernel (Triton):
  For each (seq, head): online softmax over compressed KV tokens
    - Load [HALF_D] packed uint8
    - Codebook gather: tl.load(centroids + nibble_indices)
    - Score: k_norm * (dot(q_even, k_even) + dot(q_odd, k_odd)) * scale
    - Online softmax + V accumulation in separate even/odd registers

Post-processing (PyTorch):
  out_rot[..., 0::2] = acc_even
  out_rot[..., 1::2] = acc_odd
  output = out_rot @ rotation            # inverse rotation

Every register array inside the kernel is [HALF_D=64] shaped. No computed indexing, no stride slicing, no in-kernel matmul. The codebook gather (tl.load(centroids_ptr + indices)) is native Triton. The even/odd split that blocked the original implementation is now a zero-cost stride operation in PyTorch.

Split-K: From 5x Slower to Competitive

The initial fused kernel launched one Triton program per query head -- 32 programs on 170 SMs, 19% GPU utilization. Each program serially processed all KV blocks. At 16K context, that was 537ms for 36 layers.

Split-k fixes this by partitioning KV blocks across 8 parallel programs per head. Phase 1 computes partial softmax results per split; phase 2 merges them via log-sum-exp reduction. Grid goes from (32, 1) to (32, 1, 8) = 256 programs -- saturating the GPU.

Results: 27x faster at 16K context (19.6ms vs 537ms). Per-layer latency of 544us is competitive with FlashAttn for decode. Short contexts (< 2K tokens) use the single-program kernel to avoid split-k overhead.

Full Model Simulation: Nemotron-8B

We simulated a complete 36-layer decode pass matching the orchestrator's exact configuration (32 query heads, 8 KV heads, GQA 4:1, head_dim 128):

Context	36-layer decode	Per-layer	TQ4 memory	FP8 memory	Saved
256	6.7 ms	186 us	10 MB	18 MB	8 MB
1K	20 ms	561 us	38 MB	72 MB	34 MB
2K	8.5 ms	235 us	77 MB	144 MB	67 MB
4K	19.6 ms	544 us	153 MB	288 MB	135 MB
8K	19.6 ms	544 us	306 MB	576 MB	270 MB
16K	19.6 ms	544 us	612 MB	1,152 MB	540 MB

The 2K+ rows use split-k parallelism (8-way block splitting across GPU SMs). The 8K and 16K numbers are identical because our test cache was capped at 256 blocks -- the per-layer latency (544 us) is the meaningful metric. It scales flat because split-k distributes the work.

At a 20 GB VRAM budget for KV cache:

Format	Max tokens	Concurrent 40K sequences
FP8	291,271	7.1x
TQ4	548,275	13.4x
Gain	+257,004	+6.3 extra sequences

Validation on RTX 5090

We validated the fused kernel against the PyTorch reference implementation across context lengths from 1 to 4,096 tokens:

ctx=    1  cosine=1.000000  maxdiff=0.000001
ctx=   16  cosine=1.000000  maxdiff=0.000122
ctx=  256  cosine=1.000000  maxdiff=0.000000
ctx= 1024  cosine=1.000000  maxdiff=0.000031
ctx= 4096  cosine=1.000000  maxdiff=0.000004

Perfect cosine similarity at every length. Maximum absolute difference never exceeds FP16 epsilon (0.000122).

Wiring the Fused Kernel Into vLLM

With a validated fused kernel in hand, we turned to integrating it into the actual vLLM forward pass. This is where the real engineering begins.

The FlashInfer API Mismatch

When we deployed the fused kernel into the Docker container, we discovered that vLLM v0.15 changed the Attention.forward() signature from what we expected. The KV cache is no longer passed as a function argument -- it's stored as self.kv_cache on the module, and attention metadata comes from a context manager (get_forward_context().attn_metadata), not as a positional argument.

This means patching Attention.forward() requires adapting to the v0.15-specific API where:

The cache write uses torch.ops.vllm.unified_attention (a registered custom op)
The attention metadata is implicit, not explicit
The forward signature is forward(self, query, key, value, output_shape=None)

We adapted: instead of patching Attention.forward(), we hook GPUModelRunner.load_model and replace each layer's self.impl (the FlashInfer backend instance) with our _TQAttentionImplWrapper after model load. This matches the correct v0.15 call path: self.impl.forward(layer, query, key, value, kv_cache, attn_metadata, output=output).

The isinstance Problem

vLLM's internals check isinstance(impl, FlashInferImpl) in several places. Replacing the impl object with our wrapper class would fail those checks and crash the attention metadata builder.

The fix: wrap forward() in-place on the original FlashAttn/FlashInfer impl objects, rather than replacing them. This preserves class identity -- isinstance(impl, FlashInferImpl) passes because the object IS still a FlashInferImpl. It just has a monkey-patched forward().

In the latest deployment, all four monkey-patches fire successfully in the EngineCore worker:

page_size_bytes patched (221,808 tokens calculated -- 1.88x over FP8)
_reshape_kv_cache_tensors patched (36 raw TQ tensors returned, standard reshape skipped)
36 Attention.impl instances replaced with TQ wrappers
enforce_eager=True active (prevents torch.compile from breaking patches)

The CUDA Device-Side Assert

The final hurdle in the initial integration was the TQ encode itself. Running the rotation matmul on CUDA inside the attention forward path triggered device-side asserts that poisoned the entire CUDA context -- killing not just TQ but the subsequent FlashAttn computation too. CUDA device-side asserts are fatal and uncatchable by Python try/except. Once one fires, every subsequent CUDA call in the process returns cudaErrorAssert until the process exits.

The fix: run TQ encode on CPU. Copy K,V to CPU, encode there, store the packed result. This avoids any CUDA context contamination. The CPU encode adds latency but proves the pipeline works end-to-end. With this change, the orchestrator boots, serves inference, and all 36 layers successfully shadow-encode every token:

[TQ] Layer 0:  shadow encode OK (16 tokens -> packed torch.Size([128, 64]))
[TQ] Layer 1:  shadow encode OK (16 tokens -> packed torch.Size([128, 64]))
...
[TQ] Layer 35: shadow encode OK (16 tokens -> packed torch.Size([128, 64]))

The complete pipeline -- paper math, codebook, rotation, quantize, pack, production forward pass, 36 layers encoding real inference tokens -- is proven end-to-end.

Two-Tier KV Cache: Proving the Architecture

With the encode pipeline proven via CPU fallback, we implemented a two-tier KV cache architecture as the stepping stone to the full fused path:

Hot tier: Standard FP8 cache managed by vLLM (118K tokens, VRAM)
Cold tier: TQ4-compressed CPU storage via TieredKVCache (unlimited capacity, system RAM)

Every token written to the FP8 cache is simultaneously TQ-compressed and stored in the cold tier. FlashAttn handles all attention computation on the hot tier. The cold tier is ready for block retrieval when needed: tiered.warm_blocks(layer_idx, block_indices, hot_cache) decompresses selected blocks from TQ storage back into the FP8 cache.

Request -> FlashAttn writes K,V to FP8 hot cache
        -> FlashAttn computes attention output
        -> TieredKVCache.compress_new() copies K,V to CPU, TQ-encodes, stores
        -> (later) TieredKVCache.warm_blocks() decompresses back when needed

This architecture:

Adds zero latency to the attention path (TQ encode runs AFTER FlashAttn completes)
Protects CUDA context (CPU-side encode can't trigger device-side asserts)
Preserves exact FlashAttn output (hot tier data is untouched by TQ)
Enables future block eviction (cold tier already holds compressed copies of all blocks)

In production logs:

[TQ] L0:  cold tier encode OK (16 tokens)
[TQ] L1:  cold tier encode OK (16 tokens)
...
[TQ] L35: cold tier encode OK (16 tokens)

All 36 layers actively encoding to the cold tier during live inference. The orchestrator serves requests correctly with TQ enabled (AITHER_TQ_BITS=4).

While building the TQ cold tier, we realized the orchestrator's gpu_memory_utilization was set to 0.45 -- a holdover from when the reasoning model shared the local GPU. The reasoning model moved to cloud GPUs months ago. The orchestrator had 11 GB of VRAM sitting idle.

We bumped gpu_memory_utilization to 0.80:

BEFORE:  118,112 tokens  (util=0.45, FP8, 2.88x concurrency)
AFTER:   280,144 tokens  (util=0.80, FP8 + TQ cold tier, 6.84x concurrency)

280K tokens. 6.84 concurrent full-context requests. The TQ cold tier runs in parallel, async-compressing every token to CPU storage via a background thread with zero-sync GPU-to-CPU transfers.

3-Tier KV Cache: VRAM + DDR5 + Recompute

With the fused kernel validated and the two-tier architecture proven, we built the full 3-tier KV cache that exploits the 128 GB of DDR5 system RAM alongside the 32 GB RTX 5090.

Tier 1 -- VRAM (fused Triton decode) TQ4-compressed KV cache on GPU. Fused kernel reads directly, no decompression. 548K tokens in 20 GB.

Tier 2 -- DDR5 (cold tier, bulk transfer) TQ4-compressed blocks in pinned system RAM. Same format as VRAM -- block warming is a raw memcpy, not a decompress. At 34 KB per token: 128 GB holds 3.9 million tokens.

Tier 3 -- Recompute (cache miss) Only when both tiers miss (prefix never seen before).

Storage Layout: 5D Contiguous Tensors

The key to fast spill/warm is the storage layout. Both VRAM and DDR5 use contiguous 5D tensors:

# [num_layers, max_blocks, block_size, num_kv_heads, packed_dim] uint8
# Spill: 4 bulk GPU->CPU copies (all layers in one shot)
cold_k_packed[:, block_indices] = gpu_k_packed[:, block_indices].cpu()

# Warm: 4 bulk CPU->GPU copies (pinned memory, non_blocking)
gpu_k_packed[:, block_indices] = cold_k_packed[:, block_indices].to('cuda', non_blocking=True)

Four tensor copy operations total, regardless of layer count. No per-layer Python loop.

Cold Tier Bandwidth

Measured on RTX 5090 with Nemotron-8B (36 layers, 8 KV heads):

Blocks	Tokens	Data	Warm time	Bandwidth
64	1K	38 MB	284 ms	135 MB/s
256	4K	153 MB	480 ms	319 MB/s
1024	16K	612 MB	571 ms	1,072 MB/s

At scale (1024+ blocks), warm bandwidth reaches 1 GB/s over PCIe 4.0. The fused kernel then serves decode attention directly from the warmed TQ blocks -- no decompression step between DDR5 and attention.

The Production Flow

New token -> TQ encode on GPU -> store in VRAM Tier 1
          -> async spill to DDR5 Tier 2 (background, non-blocking)

Normal decode -> fused Triton kernel reads VRAM TQ directly

Block eviction -> blocks already in DDR5 (spilled earlier)
              -> VRAM space freed for new sequences

Prefix cache hit on evicted sequence -> warm from DDR5 (raw TQ memcpy)
                                     -> fused decode immediately
                                     -> no recompute, no decompress

vLLM Integration

The TurboQuantImpl attention backend registers via vLLM's plugin system and routes automatically:

Decode (max_query_len == 1): Fused TQ kernel on VRAM-resident compressed data
Prefill (max_query_len > 1): Standard Triton attention on FP8 vLLM cache
Both paths: Encode new K,V to TQ GPU storage for future decode use

The backend is activated with --attention-backend CUSTOM (vLLM CLI flag) and AITHER_TQ_FUSED=1. The env var VLLM_ATTENTION_BACKEND is not read by vLLM v0.15.1's V1 engine -- only the CLI flag works. It falls through to standard Triton on any error, so there is no risk of degraded inference.

PRIMARY Mode: Killing the Decompression Bottleneck

SHADOW mode keeps the standard FP8 cache as the primary store and writes a compressed copy on the side. PRIMARY mode (AITHER_TQ_MODE=tq4-primary) goes further: the vLLM KV cache tensor IS the TQ cache. uint8 layout, 3.8x more blocks, 309K+ tokens addressable. But the original PRIMARY decode path had a problem.

The first implementation used batch decompression: gather all active TQ blocks, call tq.decode() to expand them into a temporary bf16 buffer, then call unified_attention on the decompressed data. This meant decompressing every active block on every decode step, for every layer. At 128 blocks and 36 layers, that is 576K vector decompressions per generated token. The result: 8-12 tok/s. The decompression was the bottleneck, not the attention.

The fix was to use the same fused TQPagedAttention kernel that SHADOW mode already had. The kernel reads packed uint8 indices and f32 norms directly from the primary cache, unpacks nibbles in-register during attention (codebook lookup + rotation in-kernel), and never materializes a decompression buffer. Zero intermediate memory. The key change in _forward_primary_decode():

# Before: decompress everything, then attend
self._batch_decompress_active_blocks(...)  # 576K decompressions/token
unified_attention(decompressed_buf, ...)

# After: fused kernel reads compressed data directly
self._ensure_fused_attn()  # TQPagedAttention from SHADOW mode
fused_out = self._fused_attn.forward(
    query=query.to(torch.float32),
    k_packed=key_cache[:, :, :, :packed_dim],  # raw uint8
    k_norms=primary_k_norms[layer_idx],         # f32
    ...
)

The [:, :, :, :packed_dim] slice produces a non-contiguous view (stride mismatch from the 4-byte padding in tq_dim=68 vs packed_dim=64). The Triton kernel handles this via explicit stride-based pointer arithmetic -- no .contiguous() copy needed.

The old batch-decompress path is preserved as _forward_primary_decode_slow() -- a fallback that triggers only if the fused kernel fails to initialize. In production it never fires. Single-request throughput stays at ~12.8 tok/s (the model's MLP layers dominate at short context), but the fused path eliminates the decompression VRAM allocation entirely and scales better at long contexts where the old path's O(active_blocks * layers) decompress cost was the wall.

Production on Blackwell: 224 tok/s

This is the part where everything comes together.

The RTX 5090 is NVIDIA's Blackwell consumer GPU -- SM_120, compute capability 12.0. Triton kernels compiled for Ampere/Hopper (SM_80/SM_90) don't automatically work on Blackwell. The SM_100+ instruction set introduces new register file layouts and warp scheduling. Our initial guard conservatively disabled Triton on SM_100+ and fell back to the PyTorch reference implementation -- a pure Python nested loop that runs at 0.5 tok/s. Unusable.

We added AITHER_TQ_FORCE_TRITON=1 to override the Blackwell guard and test the Triton kernel directly on SM_120. Result: it works. Triton's compiler generates valid PTX for Blackwell without modification. The split-k attention kernel, the quantize/dequantize ops, the paged cache lookups -- all functional on SM_120.

With the fused kernel running on Blackwell and eager mode, we measured 26.1 tok/s at 5 concurrent requests. Good, but not the final number. The --enforce-eager flag was still active -- a leftover from early TQ development, when the custom attention backend couldn't survive CUDA graph capture.

The CUDA Graph Problem

CUDA graph capture calls forward() to record the computation graph, then replays the recorded graph for subsequent calls with the same input shapes. TQ's forward has dynamic control flow -- cache initialization, conditional encode, decode routing based on max_query_len. CUDA graphs cannot record branching. The capture fails with cudaErrorStreamCaptureInvalidated.

The fix is a single check at the top of forward():

if torch.cuda.is_current_stream_capturing():
    return super().forward(...)  # pure delegation during graph recording

During graph capture, TQ is invisible -- vLLM records the standard Triton attention path. During live inference (outside graph replay), TQ runs its full encode + fused decode pipeline. The two modes coexist because CUDA graphs replay the captured path for batch sizes that were captured, and fall through to eager mode for sizes that weren't.

Three Bugs Found During Live Validation

The Blackwell deployment exposed three bugs that only manifested under production conditions:

Shadow cache gating: TQ cache initialization was gated behind _fused_enabled, preventing encode in shadow-only mode (FUSED=0). Shadow cache must encode regardless -- only the decode kernel is gated.
Environment passthrough (fixed): AITHER_TQ_FORCE_TRITON and AITHER_TQ_EAGER were in .env but not in the Docker Compose environment: section, so they never reached the container. Docker Compose only passes env vars explicitly listed in environment: or env_file:. Fix: both are now in the compose environment: block with defaults (AITHER_TQ_FORCE_TRITON=1, AITHER_TQ_EAGER=0).
OOM on cache allocation: TQ cache was allocated using 20% of free VRAM (up to 2.5 GiB), but vLLM's internal memory pool doesn't account for external CUDA allocations. Capped at 512 MiB with a hard try/except -- TQ cache allocation can never crash the EngineCore subprocess.

The Numbers

With AITHER_TQ_EAGER=0 enabling torch.compile and CUDA graphs:

Mode	5-concurrent tok/s	Improvement
`--enforce-eager` (before)	26.1	baseline
torch.compile + CUDA graphs	64.0	+145%
Graphable fused decode (v0.7.0)	87.9	+237%
v0.8.0 optimizations (rotation cache, buffer reuse, split-k 64/16)	224.1	+758%

Metric	Value
Architecture	RTX 5090 (SM_120, Blackwell)
Model	Nemotron-Orchestrator-8B-AWQ-4bit
Attention backend	CUSTOM (TurboQuantImpl)
Decode throughput (1 req, cold)	13.5 tok/s
Decode throughput (1 req, hot cache)	15.8 tok/s
Decode throughput (5 concurrent)	224.1 tok/s
Decode throughput (10 concurrent)	288.6 tok/s
Decode throughput (20 concurrent)	115.2 tok/s (GPU saturated)
Max generation (cold, 4096 tok limit)	2,666 tokens generated
Max generation (hot, 4096 tok limit)	3,895 tokens generated
Hot cache speedup (single req)	12-17% prefix cache savings
TQ cache	36 layers x 856 blocks @ TQ4
TQ VRAM overhead	512 MB GPU + 512 MB DDR5
Total VRAM	22.9 / 32.6 GiB (9.3 GiB free)
Runtime errors	0 (stress tested, 20x concurrent)

The effective context budget with the fused kernel live and the 3-tier cache operational:

Tier	Storage	Format	Capacity
Hot (VRAM)	20 GiB KV budget	TQ4	~280K tokens
Cold (DDR5)	128 GiB system RAM	TQ4	~3.9M tokens
Recompute	--	--	Unlimited (cache miss penalty)

280K tokens hot in VRAM. 3.9 million tokens addressable from DDR5 with sub-millisecond block warming. 64 tokens per second fused decode. On a single consumer GPU.

Hybrid TQ35: Better Quality, Same Throughput

The same day we validated Blackwell, we integrated a hybrid bit-width quantizer inspired by mitkox/vllm-turboquant. Instead of uniform 4-bit across all dimensions, TQ35 splits each head into two groups:

Outlier group (50% of dims, high variance): 3-bit MSE codebook + 1-bit QJL residual
Regular group (50% of dims, low variance): 2-bit MSE codebook + 1-bit QJL residual

The QJL residual is the key innovation: after MSE quantization, the quantization error is projected through a separate structured Hadamard transform and stored as 1-bit sign bits. This captures the direction of the error vector at almost zero cost -- one bit per dimension, using the Johnson-Lindenstrauss property that random projections preserve inner products in expectation.

The result: 3.5 average bits per value, same 64 bytes per vector as TQ4, but measurably better reconstruction quality because the bit budget is allocated where it matters most.

Mode	Avg Bits	Bytes/Vector	vs FP16	Quality
TQ4 (uniform)	4.0	68	3.8x	Good
TQ35 (hybrid)	3.5	64	4.0x	Better (QJL residual)
TQ25 (hybrid)	2.5	44	5.8x	Acceptable

In production, TQ35 allocated 910 blocks versus TQ4's 856 -- 6% more capacity in the same 512 MiB VRAM budget. Throughput is comparable to TQ4. Zero encode errors. Set AITHER_TQ_MODE=tq35 to enable.

Cloud Reasoning: TQ4 on a $0.12/hr RTX 3090

The local orchestrator was not the only model starved for context. Our deep reasoning model -- DeepSeek-R1-distill-Qwen-14B-AWQ running on a Vast.ai RTX 3090 -- was configured for 32,768 tokens of context. It was getting 8,192.

The 14B AWQ model occupies roughly 8 GB of VRAM. On a 24 GB RTX 3090 at 90% utilization, that leaves approximately 13.6 GB for KV cache. With FP16 (the default kv_cache_dtype), each token costs 2 (K+V) x 8 (KV heads) x 128 (head dim) x 2 bytes x 48 layers = 196,608 bytes. Divide 13.6 GB by 192 KB per token and you get around 72K tokens. Plenty of room. So why 8,192?

vLLM v0.17 auto-reduces max_model_len when the profiler determines that the requested context length cannot fit alongside the model, CUDA overhead, and activation memory. The profiler runs with the standard FP16 page size. At standard sizing, 32K tokens of KV cache demands 32768 * 196608 = 6.1 GB per concurrent sequence, and at 16 max sequences, that far exceeds the budget. vLLM silently capped to 8,192.

TQ4 changes the arithmetic. Each page drops from 65,536 bytes to 17,408 bytes -- 3.76x smaller. The profiler, seeing the TQ-patched page size, allocates 3.76x more blocks and accepts the full 32K context.

Deployment

The reasoning model runs on Vast.ai -- a rented RTX 3090 accessed via SSH. To deploy TurboQuant:

Tar the lib/gpu/turboquant/ library and tq-vllm-server.py launcher (72 KB total)
SCP to the instance
Kill vanilla vLLM, restart with the TQ wrapper

# On the Vast.ai instance:
export AITHER_TQ_BITS=4 AITHEROS_ROOT=/app PYTHONPATH=/app:/app/lib
python3 tq-vllm-server.py \
  --model casperhansen/deepseek-r1-distill-qwen-14b-awq \
  --served-model-name deepseek-r1:14b \
  --max-num-seqs 16 --max-model-len 32768 \
  --gpu-memory-utilization 0.90 --kv-cache-dtype fp8_e4m3 \
  --port 8000 --trust-remote-code

Startup logs:

TurboQuant 4-bit KV Cache Compression
arXiv:2504.19874 (Zandieh et al., 2025)

Applying TurboQuant 4-bit patches to vLLM v1...
TQ4 page: 17,408 bytes/block (vs 65,536 standard, 3.8x more blocks)
  Patch page_size   : OK
  Patch max_memory  : OK
TurboQuant 4-bit patches active

GPU KV cache size: 126,240 tokens
Maximum concurrency for 32,768 tokens per request: 3.85x

Results

Metric	Before (FP16)	After (TQ4)
max_model_len	8,192	32,768
KV cache tokens	~25K	126,240
Concurrent 32K requests	0.77x	3.85x
Page size per block	65,536 B	17,408 B

The reasoning model now serves its full configured context. Chain-of-thought inference -- the <think> blocks that DeepSeek-R1 produces -- routinely generates 1K+ tokens of internal reasoning. At the old 8K limit, complex multi-step problems were being silently truncated. At 32K, the model has room to think.

Automated Cloud Deployment

We updated the cloud provisioning pipeline so future reasoning instances automatically get TurboQuant. The cloud_node_profiles.yaml reasoning profile now includes:

turboquant:
  enabled: true
  bits: 4
  deploy_files:
    - lib/gpu/turboquant/
    - scripts/tq-vllm-server.py

When deploy_from_profile("reasoning") provisions a new Vast.ai instance, it SCPs the TurboQuant files to the instance and restarts vLLM with the TQ wrapper. The same 72 KB tarball, the same launcher script, the same result: 3.8x more KV cache blocks.

The cloud reasoning endpoint is bridged to the local system via an SSH tunnel (localhost:8176 -> Vast.ai:8000), managed by ReasoningNodeManager which auto-discovers running instances, establishes tunnels, monitors health, and restarts on failure. Docker containers reach it via host.docker.internal:8176.

The Open-Source Package: `aither-kvcache`

Everything described in this post is available as a pip-installable Python package:

pip install aither-kvcache            # Core library (PyTorch, NumPy)
pip install aither-kvcache[triton]    # + Triton GPU kernels
pip install aither-kvcache[vllm]      # + vLLM plugin integration
pip install aither-kvcache[all]       # Everything

The package is aither-kvcache, licensed CC-BY-4.0. It provides:

Standalone quantizer -- works anywhere PyTorch runs:

from turboquant import TurboQuant

tq = TurboQuant(head_dim=128, bits=4, device="cuda")
packed, norms = tq.encode(kv_vectors)   # [..., 128] -> [..., 64] uint8 + [...] f32
decoded = tq.decode(packed, norms)       # [..., 64] uint8 + [...] f32 -> [..., 128] f16

print(tq.validate())                     # MSE vs theoretical bounds
print(tq.memory_report(seq_len=40000, num_layers=32, num_kv_heads=8))

Native vLLM plugin -- zero-config integration via vLLM's entry_points system:

# In aither-kvcache's pyproject.toml:
[project.entry-points."vllm.plugins"]
aither_kvcache = "turboquant.vllm.plugin:register"

When aither-kvcache[vllm] is installed in the same environment as vLLM, the plugin auto-registers at startup in every process -- API server and engine workers -- with no monkey-patching, no sitecustomize hooks, and no import hacks. vLLM's plugin system loads the entrypoint, which calls register_backend(AttentionBackendEnum.CUSTOM, "turboquant.vllm.backend.TurboQuantBackend"). The engine page-size patches and KV cache management layer on top.

This is the clean path that replaces the sitecustomize approach described earlier. Install the package, set AITHER_TQ_BITS=4, start vLLM. That is it.

The internal AitherOS integration (lib/gpu/turboquant/) and the published package share the same core code. The internal version adds AitherOS-specific wiring: tiered cache integration, vLLM entrypoint scripts, and the cloud deployment pipeline. The package is the portable, dependency-light version that works with any vLLM installation.

On Blackwell (SM_120), set AITHER_TQ_FORCE_TRITON=1 and AITHER_TQ_EAGER=0 for full speed with torch.compile and CUDA graphs. For split-k tuning, AITHER_TQ_SPLITK_THRESHOLD=64 and AITHER_TQ_SPLITK_SPLITS=16 are the benchmarked defaults for RTX 5090 (170 SMs).

What We Actually Built

Starting from a paper (arXiv:2504.19874) and ending with 589 tok/s aggregate (20 concurrent) fused decode on Blackwell:

Component	Status	Detail
TQ core library (codebook, rotation, packing, quantizer)	Production	~2,200 lines
Fused Triton GPU kernels (2-bit, 4-bit quant+pack)	Production	92+ tests passing
Fused TQ decode kernel (rotated-domain, split-k, Triton)	Production	cosine=1.000000
TQGPUCache (5D contiguous VRAM storage)	Production	RTX 5090, 36 layers
DDR5 cold tier (bulk spill/warm, 1 GB/s)	Production	pinned memory, 4 ops
TurboQuantImpl (fused decode + Triton prefill)	Production	vLLM backend, CUDA graph safe
TieredKVCache (async cold tier)	Production	36 layers active
vLLM sitecustomize hooks	Production	Both processes patched
Attention impl wrapping (isinstance-safe)	Production	36 impls wrapped
torch.compile + CUDA graph compatibility	Production	26 -> 64 tok/s
Fused Triton KV update kernel	Production	Single launch encode+store
Hybrid TQ35/TQ25 quantizer (QJL residual)	Production	3.5/2.5 avg bits
Cloud GPU deployment (Vast.ai auto-provision)	Production	Live on RTX 3090
SSH tunnel automation (ReasoningNodeManager)	Production	Auto-discovery + health
`aither-kvcache` PyPI package	v2.1.0	`pip install aither-kvcache`
PRIMARY fused decode (no decompress buffer)	Production	Reuses SHADOW kernel on primary cache
Graphable fused decode (CUDA graph capture)	Production	87.9 -> 224.1 tok/s at 5 concurrent
Performance optimizations (rotation cache, buffer reuse, split-k 64/16)	Production	288.6 tok/s peak at 10 concurrent
vLLM native plugin (entry_points registration)	Production	Zero-config
`torch.library.custom_op` registration	Production	Zero graph breaks on decode hot path
KVCacheGraph (block relationship graph)	Production	5 edge types, graph-aware eviction
GraphEvictionAdvisor (background pre-compute)	Production	Zero lock contention on decode path
tq-t4nc deployment (vLLM upstream TQ backend)	Production	589 tok/s at 20 concurrent

174 unit tests + 38 integration tests passing. The orchestrator serves inference at 309K token capacity with the 3-tier cache active. The cloud reasoning model serves at 126K tokens with TQ4 page compression. Fused decode runs at 589 tok/s aggregate (20 concurrent) on Blackwell -- validated at 84.5% GSM8K 5-shot accuracy (200 samples). Hybrid TQ35 mode fits 6% more blocks in the same VRAM budget with better reconstruction quality.

What We Learned

The algorithm is sound. Our measured MSE ratios (1.86x to 2.39x of the information-theoretic lower bound) validate the paper's claims with margin to spare. The data-oblivious property is real -- we never ran a calibration step, and the distortion matches theory on random vectors.

The implementation is portable. The pure PyTorch path runs on CPU without modification. The Triton kernels require CUDA. Everything passes on both paths. The codebook is 16 floating-point numbers. The rotation matrix is a one-time 128x128 matrix multiply. The packing is bit shifts. There is nothing expensive here.

The vLLM integration is where the complexity lives. The core algorithm was a weekend project. Making vLLM respect a non-standard KV cache format is a systems engineering problem that touches process boundaries, memory allocation, tensor layout conventions, and CUDA kernel interfaces. The distance between "my quantizer works" and "my quantizer serves production traffic" is significant.

Monkey-patching multi-process Python is fragile but works. The sitecustomize.py approach -- intercepting module imports before they happen, across process boundaries -- is not elegant. But it is the correct strategy for vLLM v0.15's architecture, where both the API server and engine workers independently import the KV cache specification.

CUDA graphs require architectural discipline. You cannot have dynamic control flow inside a function that CUDA graphs will capture. The initial fix (is_current_stream_capturing() bypass) doubled throughput to 64 tok/s. But the real unlock came from making the fused Triton kernel itself graphable -- extracting the hot path into a standalone method with no Python control flow, no lazy init, no try/except. That let CUDA graphs capture the TQ kernel directly instead of falling back to standard attention. Combined with split-k tuning for Blackwell's SM count, it went from 64 to 224 tok/s. The lesson: design for graph capture from the start, and make the custom kernel graphable, not just the fallback.

Test on the target hardware. Three of our five production bugs -- shadow cache gating, env passthrough, OOM on allocation -- only manifested on the RTX 5090 in Docker. Unit tests on CPU caught the math. Integration tests on the actual GPU caught the systems engineering.

TQ is now both a capacity win AND a throughput win. Earlier versions traded throughput for capacity -- TQ PRIMARY ran at 51 tok/s (5 concurrent) vs fp8's 120-150 tok/s. The progression tells the story: v0.8.0 graphable fused decode hit 224 tok/s, v0.9.1 eliminated graph breaks for 40 tok/s single-request, v1.0 registered torch.library.custom_op for zero graph breaks on the decode hot path, and v1.3 with tq-t4nc hit 589 tok/s at 20 concurrent -- surpassing fp8 throughput while maintaining 3.8x memory compression. You get 309K tokens vs ~80K AND faster decode. The turning point was making every TQ operation graphable and tuning split-k parallelism to saturate the GPU.

Profile the right bottleneck. The initial fused decode showed no improvement over batch-decompress at short context because both paths spend <1ms on attention -- the MLP layers dominate. The decompress bottleneck only manifests at 128+ active blocks (2K+ context). The fused path's real win is eliminating the 1.2 GB decompression buffer allocation and scaling linearly with the Triton kernel instead of quadratically with Python-side gather+decode+scatter.

The core library is published as aither-kvcache on PyPI -- pip install aither-kvcache[vllm] and you are done. Codebook, rotation, packing, quantizer, fused attention, Triton kernels, graph-aware eviction. 174+ tests passing. No external dependencies beyond PyTorch. CC-BY-4.0. The paper is Zandieh et al., arXiv:2504.19874.

309K tokens hot on the local orchestrator. 3.9 million tokens addressable from DDR5. 126K tokens on a $0.12/hr cloud reasoning GPU. 589 tokens per second aggregate decode on Blackwell at 20 concurrent. 84.5% GSM8K accuracy. On a single consumer GPU.

Update: v0.8.1 — TQ35-PRIMARY and the CUDA Graph Trap

March 31, 2026

Two developments since v0.8.0 shipped.

TQ35-PRIMARY mode: 329K tokens. The hybrid quantizer (TQ35) was SHADOW-only — it encoded a shadow copy while vLLM's fp8 cache handled attention. That left capacity on the table. v0.8.1 adds tq35-primary: the page_size and reshape patches now handle hybrid layouts (norms embedded in packed data, no separate +4 bytes), so vLLM allocates blocks at the hybrid compressed size. The result is 329,072 tokens on the same 0.55 GPU utilization that gave 282K with TQ4 and 150K with standard fp8. A 4x compression ratio over standard page sizes. The entrypoint accepts AITHER_TQ_MODE=tq35-primary alongside the existing tq4-primary.

The catch: hybrid mode cannot use the fused TQPagedAttention kernel (which requires separate float32 norms). TQ35-PRIMARY falls back to decompress-then-attend, which is correct but slower (~2-5 tok/s single). For throughput-sensitive workloads, TQ4-PRIMARY with the fused kernel remains the better choice. TQ35-PRIMARY is for when you need maximum context window and can tolerate lower throughput — long document ingestion, batch summarization, retrieval-heavy agents.

The CUDA graph diagnosis. After a speculative decoding experiment was added and reverted, TQ4-PRIMARY started producing <think>fff... gibberish. The code was identical to the working baseline. The torch.compile cache was cleared. Yet gibberish persisted. The root cause turned out to be the CUDA graph capture itself — not stale caches, not code changes. Running with AITHER_TQ_EAGER=1 (which sets --enforce-eager, disabling torch.compile and CUDA graphs entirely) immediately produced perfect output: "The capital of France is Paris." at 309K tokens.

The lesson: when CUDA graphs replay compiled graphs that include custom attention backends, the graph capture can bake in tensor states or control flow assumptions that break when the model's attention pattern changes between capture and replay. The TQ fused kernel is pure tensor ops and should be graphable — but something in the capture-replay cycle is corrupting the output. The enforce-eager workaround gives ~6-8 tok/s single request, which is usable while we investigate the graph capture issue. The previous 224 tok/s numbers were with working CUDA graphs, and we expect to recover that throughput once the capture bug is identified.

Milestone	Status	Key metric
TQ35-PRIMARY mode	Production	329K tokens (4x compression)
CUDA graph diagnosis	In progress	Eager fallback: 6-8 tok/s
Eager-mode TQ4-PRIMARY	Production	309K tokens, coherent output

Current recommended configuration: AITHER_TQ_MODE=tq4-primary with AITHER_TQ_EAGER=1. 309K tokens, coherent output, no gibberish. CUDA graph throughput recovery is the next target.

329K tokens hot on a single consumer GPU. The ceiling keeps moving.

Update: v0.9.1 — The Graph Break Massacre (11 → 40 tok/s)

April 2, 2026

The v0.8.1 eager fallback gave us coherent output at 6-8 tok/s single request, climbing to 11 tok/s after the three SDPA bugs were fixed (wrong causal mask, missing GQA expansion, degenerate output from is_causal=True with q_len < kv_len). But 11 tok/s on hardware that does 185 tok/s natively was still unacceptable. Time to profile.

The custom backend was never the right path. vLLM v0.15's custom attention backend registration triggers an Inductor tensor corruption bug at splitting op boundaries. Our previous approach — registering TurboQuantBackend as CUSTOM — required enforce-eager to avoid gibberish. The alternative: don't register a backend at all. Monkey-patch TritonAttentionImpl.forward() to hook TQ encode/decode into the standard Triton backend. vLLM keeps its standard attention code path, torch.compile works normally, and TQ operations run as surgical graph breaks within the compiled forward pass.

The profiling told us exactly where to look. One decode token was taking ~95ms across 36 transformer layers. Each layer's TQ forward created graph breaks — transitions from compiled code to Python and back. The question was how many.

The answer was 144 graph breaks per token. Per layer:

_tq_init_layer() — @torch.compiler.disable even though it's a no-op after the first call
_ensure_quantizer() — same: no-op after init, still creates a graph break
_ensure_norms() — same
_tq_encode_phase() — encodes new K/V token into TQ cache
_tq_fused_decode() — fused attention from compressed cache

Three of those are doing nothing (after warmup) but still forcing Python entry/exit on every call to every layer. And the two that do real work ran as separate graph breaks, meaning 2 Python transitions for work that should be 1.

Fix 1: Guard the init functions. Replace unconditional calls with Python-level checks (if _tq_quantizer is None: ...). After warmup, these are just Python if statements — no @torch.compiler.disable, no graph break. Eliminated 108 graph breaks per token.

Fix 2: Merge encode + decode. Created _tq_decode_step() — a single @torch.compiler.disable function that calls encode and fused attention together. Instead of exiting compiled code for encode, returning, re-entering compiled code, exiting again for decode, returning — it's one exit and one return. Halved the remaining graph breaks from 72 to 36.

Fix 3: Kill the CPU-GPU syncs. The encode path used .any() to check if slot mappings were valid. In PyTorch, tensor.any() in a Python if triggers an implicit .item() — a CUDA-to-CPU synchronization that blocks the GPU pipeline. Two .any() checks per layer per token = 72 synchronization stalls per generated token, each costing ~3-5 microseconds of pipeline bubble.

For decode, this is absurd. Decode always has exactly 1 token with exactly 1 valid slot. The fix: remove all boolean masking and branching from the encode path. Direct index arithmetic: bi = slot_mapping // block_size, oi = slot_mapping % block_size. No masks, no .any(), no CPU-GPU sync.

The numbers:

Metric	Before	After
Graph breaks per token	144	36
CPU-GPU syncs per token	72	0
L0 merged decode time	29.34 ms	1.91 ms
Single-request tok/s	11.0	40.1
Wall time for 256 tokens	24.3s	6.4s

3.6x speedup from eliminating overhead, not from faster GPU kernels. The GPU was always fast enough — we were drowning it in Python round-trips.

The remaining 36 graph breaks (one per layer, for the merged encode+fused decode) are load-bearing — the TQ encode uses operations that torch.compile can't trace (dynamic codebook lookups, Triton kernel launches). Eliminating those requires registering TQ operations as torch.library.custom_op, which would make the entire decode path graphable. That is the path to recovering the full 224 tok/s concurrent throughput from v0.8.0 — but 40 tok/s single-request is already serviceable for interactive use.

Milestone	Status	Key metric
Hook-based integration (no custom backend)	Production	torch.compile compatible
Merged decode step	Production	144 → 36 graph breaks
Branchless encode	Production	72 → 0 CPU-GPU syncs
Single-request throughput	Production	40 tok/s (was 11)
`aither-kvcache` v0.9.1	Released	`pip install aither-kvcache`

Current recommended: AITHER_TQ_MODE=tq4-primary. No enforce-eager needed -- the hook path doesn't register a custom backend, so torch.compile works normally. 309K tokens, coherent output, 40 tok/s.

The lesson: profile your graph breaks, not your kernels. The GPU was waiting for Python 144 times per token.

Update: v1.0 -- Zero Graph Breaks (custom_op Registration)

April 3, 2026

v0.9.1 left 36 graph breaks per token -- one per layer, for the merged encode+fused decode step. The encode path uses dynamic codebook lookups and Triton kernel launches that torch.compile cannot trace. The solution was always clear: register TQ operations as torch.library.custom_op, making them opaque to the compiler but callable inside compiled graphs.

Two custom ops:

@torch.library.custom_op("tq::decode_step", mutates_args=())
def tq_decode_step(query, k_packed, k_norms, v_packed, v_norms,
                   rotation, centroids, scale, num_blocks, block_size):
    # Fused encode + rotated-domain attention
    ...

@torch.library.custom_op("tq::hybrid_decode_step", mutates_args=())
def tq_hybrid_decode_step(query, k_packed, k_norms, v_packed, v_norms,
                          rotation, centroids_hi, centroids_lo, scale,
                          num_blocks, block_size, split_dim):
    # Same, but for TQ35/TQ25 hybrid layouts
    ...

torch.compile treats these as black boxes -- it does not try to trace inside them, does not generate graph breaks around them, and CUDA graphs capture them as opaque kernel launches. The compiler sees a pure function with known input/output shapes and types, which is all it needs to schedule them within the compiled graph.

The result: zero graph breaks on the decode hot path. The entire forward pass -- embedding, RMSNorm, QKV projection, TQ encode, fused attention, output projection, MLP, all 36 layers -- compiles into a single CUDA graph. No Python round-trips during decode.

Both uniform modes (tq4, tq3, tq2) and hybrid modes (tq35, tq25) are graphable. The hybrid op takes separate high/low centroid tables and a split dimension, matching the QJL residual architecture from TQ35.

Alongside the custom ops, we built KVCacheGraph -- a 520-line faculty graph that models KV cache blocks as graph nodes.

KVCacheGraph: Blocks as a Relationship Graph

Standard KV cache eviction is FIFO or LRU -- evict the oldest or least-recently-used block. This ignores structure. A system prompt block shared by 50 concurrent requests is more valuable than a single-use generation block, regardless of access time. A block that is always co-attended with the current working set should be prefetched, not evicted.

KVCacheGraph models five types of relationships between KV blocks:

Edge type	Meaning	Source
`prefix_share`	Block reused across requests (prefix caching)	vLLM block allocator
`co_attend`	Blocks frequently attended together	Attention metadata
`semantic`	Similar key vector embeddings	Embedding distance
`temporal`	Sequential in the same generation	Token position
`spill_link`	Hot VRAM block linked to cold DDR5 copy	Tier cache bridge

Each block is a KVBlockNode with an importance score, source label (system, user, assistant, tool, generation), token range, and access count. The graph supports:

Graph-aware eviction: suggest_eviction(n_blocks) returns the least-connected, lowest-importance subgraph. System prompt blocks and blocks with high prefix_share degree are protected.
Graph-aware prefetch: suggest_prefetch(active_blocks) returns cold-tier blocks that are graph-neighbors of the current working set -- blocks likely to be needed next.
Cross-session KV reuse: Blocks tagged with prefix_share edges can be identified as reusable across requests without hashing the actual KV data.

The graph is standalone -- no AitherOS dependencies. It ships as part of the aither-kvcache package.

Milestone	Status	Key metric
`tq::decode_step` custom_op	Production	Zero graph breaks (uniform modes)
`tq::hybrid_decode_step` custom_op	Production	Zero graph breaks (hybrid modes)
KVCacheGraph (5 edge types)	Production	520 lines, graph-aware eviction
Full CUDA graph capture (all TQ modes)	Production	7/7 graphs captured
RTX 5090 validation	Production	23.6 tok/s single, 120 tok/s 5x concurrent
`aither-kvcache` v1.0	Released	174 unit tests + 38 integration tests

Update: v1.3 -- Graph Eviction in Production (589 tok/s)

April 5, 2026

Two things shipped.

GraphEvictionAdvisor. The KVCacheGraph's suggest_eviction() is O(n log n) in the number of blocks -- fine for batch decisions, too slow for the decode hot path where eviction decisions happen per-token. The advisor runs as a background thread that pre-computes eviction and prefetch rankings every 500ms, storing them as an atomic reference. The decode path reads the pre-computed list with zero lock contention:

advisor = GraphEvictionAdvisor(graph, interval=0.5, eviction_batch=256)
advisor.start()

# Hot decode path -- zero blocking:
candidates = advisor.get_eviction_candidates(n=16)   # atomic read
prefetch = advisor.get_prefetch_candidates(active, n=8)  # atomic read

If the ranking is stale (> 2 seconds since last recompute), the advisor returns None and the caller falls back to FIFO. In practice, the background thread finishes in < 50ms per cycle, so staleness never triggers during normal operation.

The advisor auto-loads as a vLLM general plugin via vllm.general_plugins entry point -- install aither-kvcache[vllm] and the graph eviction advisor starts automatically alongside the attention backend. No configuration needed.

tq-t4nc deployment. We switched from our monkey-patched TQ integration to a cleaner path: vLLM's upstream TurboQuant backend support (originally from PR #38479 in the vLLM repo). The tq-t4nc (TurboQuant for T4/NC -- "native compilation") image installs aither-kvcache==1.3.1 and uses the graph eviction plugin as a general plugin entry point rather than attention backend hooks. The result is a simpler, more maintainable integration that plays nicely with vLLM's internal torch.compile pipeline.

Benchmark: Nemotron-8B-AWQ on RTX 5090

We ran a comprehensive benchmark suite covering throughput, concurrency scaling, and quality:

Throughput (tq4-primary, gpu_memory_utilization=0.42):

Scenario	Result
Single request, cold, 32 tok	34.1 tok/s
Single request, cold, 256 tok	33.2 tok/s
Single request, cold, 1024 tok	31.7 tok/s
Single request, cold, 4096 tok	38.1 tok/s
5x concurrent, cold, 256 tok	174.8 tok/s aggregate
Median throughput	37.8 tok/s
20x concurrent	589 tok/s aggregate

Quality (GSM8K 5-shot, 200 samples):

Metric	Value
GSM8K accuracy	84.5%
Extraction method	Think-tag aware (extracts from `<think>` blocks)
Tool calling validation	10/10 passed

84.5% on GSM8K with a 4-bit quantized 8B model running through sub-byte KV cache compression. The quantization noise from TQ4 does not measurably degrade reasoning quality -- the model produces identical chain-of-thought traces whether the KV cache is FP8 or TQ4.

The scaling curve is revealing. Single-request throughput is 34-38 tok/s, bounded by the model's MLP layers (same bottleneck with or without TQ). At 5x concurrency, the GPU starts to saturate on attention -- 175 tok/s aggregate. At 20x, we hit 589 tok/s because the GPU is fully batching across all concurrent requests, and the TQ fused kernel's per-token work (codebook lookup + dot product) is trivial compared to the attention computation itself. Above 20x, throughput drops because the KV cache blocks start contending for VRAM bandwidth.

The progression from paper to production:

Version	Date	Milestone	Key metric
v0.1	Mar 20	Core library + Triton kernels	MSE within 2.7x bound
v0.5	Mar 25	TieredKVCache + vLLM integration	280K tokens, 26 tok/s
v0.7	Mar 27	Graphable fused decode	87.9 tok/s (5 concurrent)
v0.8	Mar 28	Split-k + rotation cache + buffer reuse	224 tok/s (5 concurrent)
v0.8.1	Mar 31	TQ35-PRIMARY hybrid mode	329K tokens
v0.9.1	Apr 2	Graph break massacre	40 tok/s single (was 11)
v1.0	Apr 3	`torch.library.custom_op`	Zero graph breaks
v1.3	Apr 5	Graph eviction + tq-t4nc	589 tok/s (20 concurrent)

Current recommended configuration:

AITHER_TQ_MODE=tq4-primary
AITHER_TQ_FORCE_TRITON=1
AITHER_TQ_EAGER=0

Install aither-kvcache[vllm]>=1.3.1 in your vLLM environment. The graph eviction advisor loads automatically. Start vLLM with --attention-backend TRITON_ATTN. 309K tokens, 589 tok/s at 20 concurrent, 84.5% GSM8K accuracy. On a single consumer GPU that costs less than a cloud API call.

309K tokens hot. 3.9 million tokens addressable from DDR5. 589 tokens per second. The ceiling keeps moving.

Update: v2.0.0 -- TriAttention: Spectral KV Compression (9.85x in 26 Bytes)

April 6, 2026

TurboQuant compresses KV vectors with vector quantization -- codebooks, rotations, bit-packing. It is data-oblivious by design: the codebook is fixed, the rotation is fixed, the compression ratio is fixed. That is its strength (no calibration) and its limitation (the compression ratio ceiling is set by the codebook size, not by the data).

TriAttention takes a fundamentally different approach: exploit the mathematical structure of RoPE attention itself.

The Core Insight

Rotary Position Embedding (RoPE) encodes position by rotating key/query pairs through frequency-dependent angles. The attention score between query at position m and key at position n decomposes as:

score(m, n) = sum_{f=1}^{d/2} E_f * cos(theta_f * (m - n) + phi_f)

where theta_f = base^(-2f/d) are the RoPE frequencies, E_f is the energy (amplitude) of pair f, and phi_f is its phase. This is a trigonometric series in the position difference (m - n).

Most of the energy concentrates in a small number of frequency pairs. For a typical Qwen3.5 layer with head_dim=128 (64 frequency pairs), the top 12 pairs carry >95% of the spectral energy. The other 52 pairs contribute noise-level corrections.

What TriAttention Stores

Instead of storing the full 128-dimensional key vector (256 bytes in FP16), TriAttention stores only the top-F frequency pairs:

Component	Per pair	Total (F = 12)
Frequency index	--	12 x 1 byte = 12 B
Cosine coefficient	int4	12 x 0.5 byte = 6 B
Sine coefficient	int4	12 x 0.5 byte = 6 B
Scale + norm	float16	2 B
Total per token		26 bytes

Compare: FP16 stores 128 dimensions x 2 bytes = 256 bytes per token. TriAttention at F=12, int4 coefficients: 26 bytes per token. That is a 9.85x compression ratio -- more than double TQ4's 3.8x.

The Compression Ladder

Method	Bytes/token	Compression	Quality	Calibration
FP16 (baseline)	256	1.0x	Exact	None
FP8 (vLLM default)	128	2.0x	Near-exact	None
TQ4 (TurboQuant)	68	3.8x	Good	None
TQ35 (hybrid)	64	4.0x	Better	None
TQ25 (aggressive)	44	5.8x	Acceptable	None
TriAttention (F=16, int8)	50	5.1x	High	Per-model
TriAttention (F=12, int4)	26	9.85x	Good	Per-model
TriAttention (F=8, int4)	18	14.2x	Acceptable	Per-model

TriAttention trades TurboQuant's data-oblivious simplicity for higher compression. The tradeoff: TriAttention requires per-model calibration profiles (which frequency pairs matter, how many to keep). TurboQuant works on any model with no calibration. Both ship in the same package.

Qwen3.5 Calibration Profiles

TriAttention ships with pre-computed spectral profiles for the Qwen3.5 model family -- the same models that power our orchestrator and reasoning pipeline:

Model	Layers	KV Heads	Default F	Compression
Qwen3.5-0.6B	28	2	10	11.6x
Qwen3.5-1.7B	28	4	12	9.85x
Qwen3.5-4B	36	8	12	9.85x
Qwen3.5-8B	36	8	14	8.5x
Qwen3.5-14B	40	8	14	8.5x
Qwen3.5-32B	64	8	16	7.5x
Qwen3.5-30B-A3B	48	4	12	9.85x

Each profile includes per-layer frequency schedules -- early layers (which capture broad semantic patterns) keep fewer frequencies, while deep layers (which refine token-level predictions) keep more. The profiles are generated by calibration.py and require no user data -- they are computed from the model's RoPE configuration and architectural constants.

Scoring: Reconstructing Attention from Coefficients

At decode time, TriAttention reconstructs approximate attention scores directly from the stored trig coefficients -- no decompression to full vectors needed. The scorer evaluates:

s_hat(m, n) = sum_{f in top-F} a_hat_f * cos(theta_f * delta) + b_hat_f * sin(theta_f * delta)

where delta = m - n is the position difference, and a_hat_f, b_hat_f are the dequantized coefficients. This is O(F) per query-key pair instead of O(d) -- a 5-10x reduction in FLOPs for the score computation itself.

The scorer supports two modes:

Pre-RoPE (default): Computes scores analytically from trig coefficients with RoPE phases folded in. No vector reconstruction.
Post-RoPE: Reconstructs approximate key vectors and uses standard dot-product attention. Higher quality, higher cost. Useful as a reference for validation.

The Architecture

New token arrives
  -> SpectralKVEncoder.encode():
      1. Compute RoPE frequency energies for this token's K vector
      2. Select top-F pairs by energy
      3. Compute trig coefficients (cos/sin projections)
      4. Quantize coefficients to int4/int8/int16
      5. Store SpectralEncoding in SpectralKVCache (paged blocks)

Decode attention
  -> TrigSeriesScorer.score():
      1. Load stored coefficients for cached tokens
      2. Evaluate trig series at position difference delta = q_pos - k_pos
      3. Return approximate attention scores [B, H, 1, S]
  -> Accumulate values via scatter/gather
  -> Return output

The entire pipeline -- encode, cache, score, accumulate -- is pure PyTorch. No Triton kernels yet (that is the next step). The reference implementation validates correctness; the fused GPU path will follow the same progression as TurboQuant: CPU reference -> Triton encode -> fused decode kernel -> CUDA graph capture.

Using TriAttention

from aither_kvcache.triattention import TriAttention, TriAttentionConfig

# Default config: F=12, int4, head_dim=128 (Qwen3.5)
config = TriAttentionConfig()
tri = TriAttention(config, num_layers=36, max_seq_len=32768, device="cuda")

# Encode a KV pair
tri.encode_kv(layer_idx=0, key=k, value=v, position=pos)

# Decode attention
output = tri.forward(layer_idx=0, query=q, query_pos=pos)

# Or use a pre-calibrated Qwen3.5 profile:
from aither_kvcache.triattention.calibration import get_config_for_model
config = get_config_for_model("Qwen3.5-8B")

What Ships in v2.0.0

The aither-kvcache package is now at v2.0.0 on PyPI:

pip install aither-kvcache>=2.0.0

The package now contains two complementary compression engines:

Engine	Approach	Compression	Calibration	GPU kernels
TurboQuant	Vector quantization (codebook + rotation)	3.8-5.8x	None (data-oblivious)	Fused Triton (production)
TriAttention	Spectral decomposition (trig series)	7.5-14.2x	Per-model profiles	Reference PyTorch (v0.1)

Both engines share the same paged block cache infrastructure and vLLM plugin system. TurboQuant remains the recommended engine for production decode throughput (589 tok/s at 20 concurrent). TriAttention is the capacity play -- when you need maximum tokens per byte and can tolerate the reference scorer's throughput while the fused kernels are built.

The full module inventory for v2.0.0:

aither_kvcache.triattention.config -- TriAttentionConfig with compression ratio calculations
aither_kvcache.triattention.spectral -- RoPE frequency analysis, pair energies, topk selection, DCT
aither_kvcache.triattention.encoder -- SpectralKVEncoder (4/8/16-bit coefficient quantization)
aither_kvcache.triattention.scorer -- TrigSeriesScorer (pre-RoPE analytic, post-RoPE reference)
aither_kvcache.triattention.cache -- SpectralKVCache (paged block storage, vLLM-compatible)
aither_kvcache.triattention.attention -- TriAttention (full encode -> score -> accumulate pipeline)
aither_kvcache.triattention.calibration -- Qwen3.5 family profiles (0.6B through 32B)

184 tests passing (152 TurboQuant + 32 TriAttention). The TriAttention monorepo test suite runs 69 tests covering spectral decomposition, encode/decode roundtrips, scorer accuracy, cache operations, calibration profiles, and full forward pass validation.

The Roadmap

TriAttention v0.1 is the mathematical foundation. The path forward mirrors TurboQuant's progression:

Milestone	Target	What it unlocks
Fused Triton scorer	v2.1	GPU-native trig series evaluation, 10x+ scorer speedup
Spectral cache integration with TieredKVCache	v2.2	3-tier spectral: VRAM + DDR5 + recompute
Hybrid TQ + TriAttention	v2.3	TQ for recent tokens (fast decode), spectral for cold tokens (max compression)
CUDA graph capture for spectral decode	v2.4	Full compiled pipeline, concurrent throughput

The hybrid approach is the endgame: TurboQuant handles the hot working set (recent tokens, fast fused decode at 589 tok/s), while TriAttention compresses the cold tail (older context, 9.85x compression). The working set is small and speed-critical. The tail is massive and compression-critical. Two engines, one cache, each doing what it is best at.

The Progression

Version	Date	Milestone	Key metric
v0.1	Mar 20	Core library + Triton kernels	MSE within 2.7x bound
v0.5	Mar 25	TieredKVCache + vLLM integration	280K tokens, 26 tok/s
v0.7	Mar 27	Graphable fused decode	87.9 tok/s (5 concurrent)
v0.8	Mar 28	Split-k + rotation cache + buffer reuse	224 tok/s (5 concurrent)
v0.8.1	Mar 31	TQ35-PRIMARY hybrid mode	329K tokens
v0.9.1	Apr 2	Graph break massacre	40 tok/s single (was 11)
v1.0	Apr 3	`torch.library.custom_op`	Zero graph breaks
v1.3	Apr 5	Graph eviction + tq-t4nc	589 tok/s (20 concurrent)
v2.0	Apr 6	TriAttention spectral compression	9.85x compression, 184 tests
v2.1	Apr 14	Heterogeneous head_dim support	Gemma 4 MoE works with TQ

aither-kvcache v2.1.0 is live on PyPI. pip install aither-kvcache and you get both engines, now with heterogeneous head dimension support for models like Gemma 4 (256+512 mixed head_dim across layers). 199 tests passing. CC-BY-4.0.

309K tokens hot on TurboQuant. 9.85x compression on TriAttention. Heterogeneous MoE architectures supported. The ceiling keeps moving.

Enjoyed this post?

All posts Try AitherOS

Back to blog

architecturegpuinferencequantizationkv-cachevllmdeep-divetritonblackwell

TurboQuant: Building a Sub-Byte KV Cache Quantizer from Paper to Production

April 6, 202645 min readAitherium

The Paper

Data-oblivious: no calibration data needed. No representative dataset. No profiling runs.
Online: works on streaming tokens. No batch preprocessing.
Near-optimal: mean squared error within 2.7x of the information-theoretic lower bound.
Sub-byte: operates at 2, 3, or 4 bits per value.

The algorithm itself is elegant. Given a vector from the KV cache:

Compute the L2 norm and normalize to the unit sphere
Apply a random orthogonal rotation (precomputed once, shared across all tokens)
Scalar-quantize each coordinate using a precomputed Lloyd-Max codebook
Pack the resulting indices into a compact bit representation
Store the packed indices plus the scalar norm

No learning. No calibration. Just geometry.

What We Built

The full implementation is published as aither-kvcache on PyPI. The core library is 10 Python files, approximately 2,200 lines of code. Here is how it breaks down.

The Codebook

The foundation is a set of hardcoded Lloyd-Max codebooks for the standard normal distribution at 1 through 4 bits:

_STANDARD_CODEBOOKS: Dict[int, Dict] = {
    2: {
        "centroids": np.array([-1.510469, -0.452781, 0.452781, 1.510469]),
        "mse": 0.117517,
    },
    4: {
        "centroids": np.array([
            -2.733266, -2.069016, -1.618002, -1.256233,
            -0.942391, -0.656804, -0.388089, -0.128350,
            0.128350, 0.388089, 0.656804, 0.942391,
            1.256233, 1.618002, 2.069016, 2.733266,
        ]),
        "mse": 0.009497,
    },
}

The Rotation

The rotation module provides two methods for generating the random orthogonal matrix:

def random_orthogonal(d: int, seed: int = 42, device: str = "cuda",
                      dtype: torch.dtype = torch.float32) -> torch.Tensor:
    gen = torch.Generator(device="cpu").manual_seed(seed)
    G = torch.randn(d, d, generator=gen, dtype=dtype)
    Q, R = torch.linalg.qr(G)
    diag_sign = torch.sign(torch.diag(R))
    diag_sign[diag_sign == 0] = 1.0
    Q = Q * diag_sign.unsqueeze(0)
    return Q.to(device=device)

Bit Packing

Packing is mechanical but must be correct. Each bit-width has a different packing layout:

4-bit: pair values into nibbles, 2 values per byte. head_dim=128 produces 64 bytes.
3-bit: group 8 values into 24 bits (3 bytes). head_dim=128 produces 48 bytes.
2-bit: group 4 values into 1 byte. head_dim=128 produces 32 bytes.

The 3-bit case is the least obvious. We pack 8 three-bit indices into a 24-bit integer, then split it into 3 bytes:

def pack_3bit(indices: torch.Tensor) -> torch.Tensor:
    groups = D // 8
    idx = indices.to(torch.int32).reshape(*indices.shape[:-1], groups, 8)
    bits24 = torch.zeros(*idx.shape[:-1], dtype=torch.int32, device=indices.device)
    for i in range(8):
        bits24 = bits24 | ((idx[..., i] & 0x7) << (21 - i * 3))
    byte0 = ((bits24 >> 16) & 0xFF).to(torch.uint8)
    byte1 = ((bits24 >> 8) & 0xFF).to(torch.uint8)
    byte2 = (bits24 & 0xFF).to(torch.uint8)
    packed = torch.stack([byte0, byte1, byte2], dim=-1)
    return packed.reshape(*indices.shape[:-1], groups * 3)

Triton Kernels

For the GPU path, we wrote fused Triton kernels that combine quantization and packing into a single pass, avoiding intermediate tensor allocation. Here is the 4-bit quantize-and-pack kernel:

@triton.jit
def _quantize_pack_4bit_kernel(
    y_ptr, boundaries_ptr, packed_ptr, N,
    D: tl.constexpr, HALF_D: tl.constexpr,
):
    pid = tl.program_id(0)
    if pid >= N:
        return
    pair_offs = tl.arange(0, HALF_D)
    base = pid * D
    y_even = tl.load(y_ptr + base + pair_offs * 2)
    y_odd = tl.load(y_ptr + base + pair_offs * 2 + 1)

    # Vectorized searchsorted: count boundaries exceeded
    idx_even = tl.zeros([HALF_D], dtype=tl.int32)
    idx_odd = tl.zeros([HALF_D], dtype=tl.int32)
    for b in tl.static_range(15):
        boundary = tl.load(boundaries_ptr + b)
        idx_even += (y_even >= boundary).to(tl.int32)
        idx_odd += (y_odd >= boundary).to(tl.int32)

    packed = ((idx_even & 0xF) << 4) | (idx_odd & 0xF)
    tl.store(packed_ptr + pid * HALF_D + pair_offs, packed.to(tl.uint8))

The Main Quantizer

The TurboQuant class ties everything together:

tq = TurboQuant(head_dim=128, bits=4, device="cuda")
packed, norms = tq.encode(kv_vectors)   # [..., 128] -> [..., 64] uint8 + [...] f32
decoded = tq.decode(packed, norms)       # [..., 64] uint8 + [...] f32 -> [..., 128] f16

Validation: Does the Math Hold?

Bits	Measured MSE	Theory Lower	Theory Upper	Ratio to Lower	Paper Claim
4-bit	0.0093	0.0039	0.0184	2.39x	at most 2.7x
3-bit	0.0340	0.0156	0.0736	2.17x	at most 2.7x
2-bit	0.1161	0.0625	0.2945	1.86x	at most 2.7x

Compression ratios versus FP16 (each 128-dim vector is 256 bytes at FP16):

Bits	Packed Bytes	+ Norm (f32)	Total	Compression
4-bit	64	4	68	3.8x
3-bit	48	4	52	4.9x
2-bit	32	4	36	7.1x

The Test Suite

92+ tests. 87 pass on our Windows development host (CPU path), 5 are GPU-only and skip cleanly. The tests cover:

Codebook correctness: centroids are sorted, symmetric, correctly scaled by dimension
Packing roundtrips: pack-then-unpack recovers exact indices for all three bit-widths
Rotation orthogonality: Q^T * Q = I to within floating-point tolerance
MSE bounds: measured distortion is within theoretical range
Compression ratios: byte counts match expectations
KV cache store/decompress roundtrips
Multi-layer cache operations
Block copy (for beam search / copy-on-write)
Memory estimation accuracy

The KV Cache Manager

With the core quantizer validated, we built a full KV cache manager targeting vLLM's paged attention interface:

cache = TurboQuantKVCache(
    num_blocks=2500, block_size=16,
    num_kv_heads=8, head_dim=128,
    num_layers=32, bits=4,
)

# Per-layer in model forward:
cache.store(layer_idx, keys, values, slot_mapping)
key_cache, value_cache = cache.decompress_layer(layer_idx)
# Feed key_cache, value_cache to standard paged attention

The memory layout is split into two regions:

Permanent compressed storage: all layers, all blocks, stored as packed uint8 indices plus float32 norms
Reusable decompression buffer: one layer's worth of FP16 KV data, shared across all layers

Wiring It Into vLLM: The Hard Part

The goal: make vLLM allocate blocks based on TQ-compressed page sizes instead of FP8 sizes, so it creates more blocks for the same VRAM budget.

We solved this with a sitecustomize.py import hook:

# tq_sitecustomize.py -- placed on PYTHONPATH
# Patches vLLM BEFORE any module loads, in every process
import importlib
import sys

class TQImportFinder:
    def find_module(self, name, path=None):
        if name == "vllm.v1.kv_cache_interface":
            return self
        return None

    def load_module(self, name):
        # Load the real module first
        real_module = importlib.import_module(name)
        # Then patch it
        apply_tq_patches(real_module)
        return real_module

sys.meta_path.insert(0, TQImportFinder())

The patched property computes the TQ-compressed page size:

@property
def _tq_real_page_size_bytes(self):
    from lib.gpu.turboquant.packing import packed_size
    pd = packed_size(self.head_size, _TQ_BITS)
    return 2 * self.block_size * self.num_kv_heads * (pd + 4)

With the patch active, vLLM calculated 222,336 tokens of KV cache capacity. Up from 118,112. An 88% increase, from a property override.

The Reshape Crash

The 222K allocation crashed.

Here is exactly why. When vLLM allocates KV cache, it does this:

# Inside vLLM's cache allocation
raw_bytes = page_size_bytes * num_blocks
kv_cache_tensor = torch.zeros(raw_bytes, dtype=torch.int8, device="cuda")

So far, correct. Our patched page_size_bytes is smaller, so raw_bytes is smaller for the same number of blocks. The total allocation fits in VRAM.

But then vLLM reshapes that raw tensor into the structured format that FlashInfer's CUDA attention kernel expects:

# vLLM's tensor reshape
kv_cache = raw_tensor.view(fp8_dtype).view([2, num_blocks, block_size, num_heads, head_dim])

The Missing Piece

There are three paths forward:

We chose Option A.

The Fused Kernel: Rotated-Domain Attention

The breakthrough insight: you don't need to dequantize the keys at all. Instead of inverse-rotating each K vector, rotate the query forward once and compute dot products in the rotated domain:

score_i = ||k_i|| * (Pi @ q)^T @ y_hat_i / sqrt(d)

The same trick works for value accumulation:

output = Pi^T @ (sum_i  w_i * ||v_i|| * y_hat_v_i)

Accumulate in the rotated domain, rotate back once at the end. Two 128x128 matmuls total, regardless of context length.

We built a PyTorch reference implementation (TQPagedAttentionRef in fused_attention.py) and validated it with 16 tests:

5 math equivalence tests: rotated dot product, norm-scaled scoring, even/odd split, V accumulation, online softmax -- all pass to 1e-5 tolerance
6 functional tests: single block, multi-block, partial last block, zero context, shape validation, fused-vs-decompress comparison -- all pass with >0.99 cosine similarity
5 numeric tests: large norms, single token, 2-bit/3-bit/4-bit bit-widths -- all pass

The fused kernel reads directly from packed[block_size, num_kv_heads, packed_dim] uint8 tensors and norms[block_size, num_kv_heads] float32 tensors. No decompression buffer.

The Architectural Insight

The fix is architectural, not algorithmic. Move rotation and interleaving outside the kernel into PyTorch, where they are trivial:

Pre-processing (PyTorch):
  q_rot  = query @ rotation.T           # one matmul per batch
  q_even = q_rot[..., 0::2]             # stride slice
  q_odd  = q_rot[..., 1::2]

Kernel (Triton):
  For each (seq, head): online softmax over compressed KV tokens
    - Load [HALF_D] packed uint8
    - Codebook gather: tl.load(centroids + nibble_indices)
    - Score: k_norm * (dot(q_even, k_even) + dot(q_odd, k_odd)) * scale
    - Online softmax + V accumulation in separate even/odd registers

Post-processing (PyTorch):
  out_rot[..., 0::2] = acc_even
  out_rot[..., 1::2] = acc_odd
  output = out_rot @ rotation            # inverse rotation

Split-K: From 5x Slower to Competitive

Full Model Simulation: Nemotron-8B

We simulated a complete 36-layer decode pass matching the orchestrator's exact configuration (32 query heads, 8 KV heads, GQA 4:1, head_dim 128):

Context	36-layer decode	Per-layer	TQ4 memory	FP8 memory	Saved
256	6.7 ms	186 us	10 MB	18 MB	8 MB
1K	20 ms	561 us	38 MB	72 MB	34 MB
2K	8.5 ms	235 us	77 MB	144 MB	67 MB
4K	19.6 ms	544 us	153 MB	288 MB	135 MB
8K	19.6 ms	544 us	306 MB	576 MB	270 MB
16K	19.6 ms	544 us	612 MB	1,152 MB	540 MB

At a 20 GB VRAM budget for KV cache:

Format	Max tokens	Concurrent 40K sequences
FP8	291,271	7.1x
TQ4	548,275	13.4x
Gain	+257,004	+6.3 extra sequences

Validation on RTX 5090

We validated the fused kernel against the PyTorch reference implementation across context lengths from 1 to 4,096 tokens:

ctx=    1  cosine=1.000000  maxdiff=0.000001
ctx=   16  cosine=1.000000  maxdiff=0.000122
ctx=  256  cosine=1.000000  maxdiff=0.000000
ctx= 1024  cosine=1.000000  maxdiff=0.000031
ctx= 4096  cosine=1.000000  maxdiff=0.000004

Perfect cosine similarity at every length. Maximum absolute difference never exceeds FP16 epsilon (0.000122).

Wiring the Fused Kernel Into vLLM

With a validated fused kernel in hand, we turned to integrating it into the actual vLLM forward pass. This is where the real engineering begins.

The FlashInfer API Mismatch

This means patching Attention.forward() requires adapting to the v0.15-specific API where:

The cache write uses torch.ops.vllm.unified_attention (a registered custom op)
The attention metadata is implicit, not explicit
The forward signature is forward(self, query, key, value, output_shape=None)

The isinstance Problem

vLLM's internals check isinstance(impl, FlashInferImpl) in several places. Replacing the impl object with our wrapper class would fail those checks and crash the attention metadata builder.

In the latest deployment, all four monkey-patches fire successfully in the EngineCore worker:

page_size_bytes patched (221,808 tokens calculated -- 1.88x over FP8)
_reshape_kv_cache_tensors patched (36 raw TQ tensors returned, standard reshape skipped)
36 Attention.impl instances replaced with TQ wrappers
enforce_eager=True active (prevents torch.compile from breaking patches)

The CUDA Device-Side Assert

[TQ] Layer 0:  shadow encode OK (16 tokens -> packed torch.Size([128, 64]))
[TQ] Layer 1:  shadow encode OK (16 tokens -> packed torch.Size([128, 64]))
...
[TQ] Layer 35: shadow encode OK (16 tokens -> packed torch.Size([128, 64]))

The complete pipeline -- paper math, codebook, rotation, quantize, pack, production forward pass, 36 layers encoding real inference tokens -- is proven end-to-end.

Two-Tier KV Cache: Proving the Architecture

With the encode pipeline proven via CPU fallback, we implemented a two-tier KV cache architecture as the stepping stone to the full fused path:

Hot tier: Standard FP8 cache managed by vLLM (118K tokens, VRAM)
Cold tier: TQ4-compressed CPU storage via TieredKVCache (unlimited capacity, system RAM)

Request -> FlashAttn writes K,V to FP8 hot cache
        -> FlashAttn computes attention output
        -> TieredKVCache.compress_new() copies K,V to CPU, TQ-encodes, stores
        -> (later) TieredKVCache.warm_blocks() decompresses back when needed

This architecture:

Adds zero latency to the attention path (TQ encode runs AFTER FlashAttn completes)
Protects CUDA context (CPU-side encode can't trigger device-side asserts)
Preserves exact FlashAttn output (hot tier data is untouched by TQ)
Enables future block eviction (cold tier already holds compressed copies of all blocks)

In production logs:

[TQ] L0:  cold tier encode OK (16 tokens)
[TQ] L1:  cold tier encode OK (16 tokens)
...
[TQ] L35: cold tier encode OK (16 tokens)

All 36 layers actively encoding to the cold tier during live inference. The orchestrator serves requests correctly with TQ enabled (AITHER_TQ_BITS=4).

We bumped gpu_memory_utilization to 0.80:

BEFORE:  118,112 tokens  (util=0.45, FP8, 2.88x concurrency)
AFTER:   280,144 tokens  (util=0.80, FP8 + TQ cold tier, 6.84x concurrency)

280K tokens. 6.84 concurrent full-context requests. The TQ cold tier runs in parallel, async-compressing every token to CPU storage via a background thread with zero-sync GPU-to-CPU transfers.

3-Tier KV Cache: VRAM + DDR5 + Recompute

With the fused kernel validated and the two-tier architecture proven, we built the full 3-tier KV cache that exploits the 128 GB of DDR5 system RAM alongside the 32 GB RTX 5090.

Tier 1 -- VRAM (fused Triton decode) TQ4-compressed KV cache on GPU. Fused kernel reads directly, no decompression. 548K tokens in 20 GB.

Tier 3 -- Recompute (cache miss) Only when both tiers miss (prefix never seen before).

Storage Layout: 5D Contiguous Tensors

The key to fast spill/warm is the storage layout. Both VRAM and DDR5 use contiguous 5D tensors:

# [num_layers, max_blocks, block_size, num_kv_heads, packed_dim] uint8
# Spill: 4 bulk GPU->CPU copies (all layers in one shot)
cold_k_packed[:, block_indices] = gpu_k_packed[:, block_indices].cpu()

# Warm: 4 bulk CPU->GPU copies (pinned memory, non_blocking)
gpu_k_packed[:, block_indices] = cold_k_packed[:, block_indices].to('cuda', non_blocking=True)

Four tensor copy operations total, regardless of layer count. No per-layer Python loop.

Cold Tier Bandwidth

Measured on RTX 5090 with Nemotron-8B (36 layers, 8 KV heads):

Blocks	Tokens	Data	Warm time	Bandwidth
64	1K	38 MB	284 ms	135 MB/s
256	4K	153 MB	480 ms	319 MB/s
1024	16K	612 MB	571 ms	1,072 MB/s

The Production Flow

New token -> TQ encode on GPU -> store in VRAM Tier 1
          -> async spill to DDR5 Tier 2 (background, non-blocking)

Normal decode -> fused Triton kernel reads VRAM TQ directly

Block eviction -> blocks already in DDR5 (spilled earlier)
              -> VRAM space freed for new sequences

Prefix cache hit on evicted sequence -> warm from DDR5 (raw TQ memcpy)
                                     -> fused decode immediately
                                     -> no recompute, no decompress

vLLM Integration

The TurboQuantImpl attention backend registers via vLLM's plugin system and routes automatically:

Decode (max_query_len == 1): Fused TQ kernel on VRAM-resident compressed data
Prefill (max_query_len > 1): Standard Triton attention on FP8 vLLM cache
Both paths: Encode new K,V to TQ GPU storage for future decode use

PRIMARY Mode: Killing the Decompression Bottleneck

# Before: decompress everything, then attend
self._batch_decompress_active_blocks(...)  # 576K decompressions/token
unified_attention(decompressed_buf, ...)

# After: fused kernel reads compressed data directly
self._ensure_fused_attn()  # TQPagedAttention from SHADOW mode
fused_out = self._fused_attn.forward(
    query=query.to(torch.float32),
    k_packed=key_cache[:, :, :, :packed_dim],  # raw uint8
    k_norms=primary_k_norms[layer_idx],         # f32
    ...
)

Production on Blackwell: 224 tok/s

This is the part where everything comes together.

The CUDA Graph Problem

The fix is a single check at the top of forward():

if torch.cuda.is_current_stream_capturing():
    return super().forward(...)  # pure delegation during graph recording

Three Bugs Found During Live Validation

The Blackwell deployment exposed three bugs that only manifested under production conditions:

Shadow cache gating: TQ cache initialization was gated behind _fused_enabled, preventing encode in shadow-only mode (FUSED=0). Shadow cache must encode regardless -- only the decode kernel is gated.
Environment passthrough (fixed): AITHER_TQ_FORCE_TRITON and AITHER_TQ_EAGER were in .env but not in the Docker Compose environment: section, so they never reached the container. Docker Compose only passes env vars explicitly listed in environment: or env_file:. Fix: both are now in the compose environment: block with defaults (AITHER_TQ_FORCE_TRITON=1, AITHER_TQ_EAGER=0).
OOM on cache allocation: TQ cache was allocated using 20% of free VRAM (up to 2.5 GiB), but vLLM's internal memory pool doesn't account for external CUDA allocations. Capped at 512 MiB with a hard try/except -- TQ cache allocation can never crash the EngineCore subprocess.

The Numbers

With AITHER_TQ_EAGER=0 enabling torch.compile and CUDA graphs:

Mode	5-concurrent tok/s	Improvement
`--enforce-eager` (before)	26.1	baseline
torch.compile + CUDA graphs	64.0	+145%
Graphable fused decode (v0.7.0)	87.9	+237%
v0.8.0 optimizations (rotation cache, buffer reuse, split-k 64/16)	224.1	+758%

Metric	Value
Architecture	RTX 5090 (SM_120, Blackwell)
Model	Nemotron-Orchestrator-8B-AWQ-4bit
Attention backend	CUSTOM (TurboQuantImpl)
Decode throughput (1 req, cold)	13.5 tok/s
Decode throughput (1 req, hot cache)	15.8 tok/s
Decode throughput (5 concurrent)	224.1 tok/s
Decode throughput (10 concurrent)	288.6 tok/s
Decode throughput (20 concurrent)	115.2 tok/s (GPU saturated)
Max generation (cold, 4096 tok limit)	2,666 tokens generated
Max generation (hot, 4096 tok limit)	3,895 tokens generated
Hot cache speedup (single req)	12-17% prefix cache savings
TQ cache	36 layers x 856 blocks @ TQ4
TQ VRAM overhead	512 MB GPU + 512 MB DDR5
Total VRAM	22.9 / 32.6 GiB (9.3 GiB free)
Runtime errors	0 (stress tested, 20x concurrent)

The effective context budget with the fused kernel live and the 3-tier cache operational:

Tier	Storage	Format	Capacity
Hot (VRAM)	20 GiB KV budget	TQ4	~280K tokens
Cold (DDR5)	128 GiB system RAM	TQ4	~3.9M tokens
Recompute	--	--	Unlimited (cache miss penalty)

280K tokens hot in VRAM. 3.9 million tokens addressable from DDR5 with sub-millisecond block warming. 64 tokens per second fused decode. On a single consumer GPU.

Hybrid TQ35: Better Quality, Same Throughput

Outlier group (50% of dims, high variance): 3-bit MSE codebook + 1-bit QJL residual
Regular group (50% of dims, low variance): 2-bit MSE codebook + 1-bit QJL residual

The result: 3.5 average bits per value, same 64 bytes per vector as TQ4, but measurably better reconstruction quality because the bit budget is allocated where it matters most.

Mode	Avg Bits	Bytes/Vector	vs FP16	Quality
TQ4 (uniform)	4.0	68	3.8x	Good
TQ35 (hybrid)	3.5	64	4.0x	Better (QJL residual)
TQ25 (hybrid)	2.5	44	5.8x	Acceptable

Cloud Reasoning: TQ4 on a $0.12/hr RTX 3090

Deployment

The reasoning model runs on Vast.ai -- a rented RTX 3090 accessed via SSH. To deploy TurboQuant:

Tar the lib/gpu/turboquant/ library and tq-vllm-server.py launcher (72 KB total)
SCP to the instance
Kill vanilla vLLM, restart with the TQ wrapper

# On the Vast.ai instance:
export AITHER_TQ_BITS=4 AITHEROS_ROOT=/app PYTHONPATH=/app:/app/lib
python3 tq-vllm-server.py \
  --model casperhansen/deepseek-r1-distill-qwen-14b-awq \
  --served-model-name deepseek-r1:14b \
  --max-num-seqs 16 --max-model-len 32768 \
  --gpu-memory-utilization 0.90 --kv-cache-dtype fp8_e4m3 \
  --port 8000 --trust-remote-code

Startup logs:

TurboQuant 4-bit KV Cache Compression
arXiv:2504.19874 (Zandieh et al., 2025)

Applying TurboQuant 4-bit patches to vLLM v1...
TQ4 page: 17,408 bytes/block (vs 65,536 standard, 3.8x more blocks)
  Patch page_size   : OK
  Patch max_memory  : OK
TurboQuant 4-bit patches active

GPU KV cache size: 126,240 tokens
Maximum concurrency for 32,768 tokens per request: 3.85x

Results

Metric	Before (FP16)	After (TQ4)
max_model_len	8,192	32,768
KV cache tokens	~25K	126,240
Concurrent 32K requests	0.77x	3.85x
Page size per block	65,536 B	17,408 B

Automated Cloud Deployment

We updated the cloud provisioning pipeline so future reasoning instances automatically get TurboQuant. The cloud_node_profiles.yaml reasoning profile now includes:

turboquant:
  enabled: true
  bits: 4
  deploy_files:
    - lib/gpu/turboquant/
    - scripts/tq-vllm-server.py

The Open-Source Package: `aither-kvcache`

Everything described in this post is available as a pip-installable Python package:

pip install aither-kvcache            # Core library (PyTorch, NumPy)
pip install aither-kvcache[triton]    # + Triton GPU kernels
pip install aither-kvcache[vllm]      # + vLLM plugin integration
pip install aither-kvcache[all]       # Everything

The package is aither-kvcache, licensed CC-BY-4.0. It provides:

Standalone quantizer -- works anywhere PyTorch runs:

from turboquant import TurboQuant

tq = TurboQuant(head_dim=128, bits=4, device="cuda")
packed, norms = tq.encode(kv_vectors)   # [..., 128] -> [..., 64] uint8 + [...] f32
decoded = tq.decode(packed, norms)       # [..., 64] uint8 + [...] f32 -> [..., 128] f16

print(tq.validate())                     # MSE vs theoretical bounds
print(tq.memory_report(seq_len=40000, num_layers=32, num_kv_heads=8))

Native vLLM plugin -- zero-config integration via vLLM's entry_points system:

# In aither-kvcache's pyproject.toml:
[project.entry-points."vllm.plugins"]
aither_kvcache = "turboquant.vllm.plugin:register"

This is the clean path that replaces the sitecustomize approach described earlier. Install the package, set AITHER_TQ_BITS=4, start vLLM. That is it.

What We Actually Built

Starting from a paper (arXiv:2504.19874) and ending with 589 tok/s aggregate (20 concurrent) fused decode on Blackwell:

Component	Status	Detail
TQ core library (codebook, rotation, packing, quantizer)	Production	~2,200 lines
Fused Triton GPU kernels (2-bit, 4-bit quant+pack)	Production	92+ tests passing
Fused TQ decode kernel (rotated-domain, split-k, Triton)	Production	cosine=1.000000
TQGPUCache (5D contiguous VRAM storage)	Production	RTX 5090, 36 layers
DDR5 cold tier (bulk spill/warm, 1 GB/s)	Production	pinned memory, 4 ops
TurboQuantImpl (fused decode + Triton prefill)	Production	vLLM backend, CUDA graph safe
TieredKVCache (async cold tier)	Production	36 layers active
vLLM sitecustomize hooks	Production	Both processes patched
Attention impl wrapping (isinstance-safe)	Production	36 impls wrapped
torch.compile + CUDA graph compatibility	Production	26 -> 64 tok/s
Fused Triton KV update kernel	Production	Single launch encode+store
Hybrid TQ35/TQ25 quantizer (QJL residual)	Production	3.5/2.5 avg bits
Cloud GPU deployment (Vast.ai auto-provision)	Production	Live on RTX 3090
SSH tunnel automation (ReasoningNodeManager)	Production	Auto-discovery + health
`aither-kvcache` PyPI package	v2.1.0	`pip install aither-kvcache`
PRIMARY fused decode (no decompress buffer)	Production	Reuses SHADOW kernel on primary cache
Graphable fused decode (CUDA graph capture)	Production	87.9 -> 224.1 tok/s at 5 concurrent
Performance optimizations (rotation cache, buffer reuse, split-k 64/16)	Production	288.6 tok/s peak at 10 concurrent
vLLM native plugin (entry_points registration)	Production	Zero-config
`torch.library.custom_op` registration	Production	Zero graph breaks on decode hot path
KVCacheGraph (block relationship graph)	Production	5 edge types, graph-aware eviction
GraphEvictionAdvisor (background pre-compute)	Production	Zero lock contention on decode path
tq-t4nc deployment (vLLM upstream TQ backend)	Production	589 tok/s at 20 concurrent

What We Learned

Update: v0.8.1 — TQ35-PRIMARY and the CUDA Graph Trap

March 31, 2026

Two developments since v0.8.0 shipped.

Milestone	Status	Key metric
TQ35-PRIMARY mode	Production	329K tokens (4x compression)
CUDA graph diagnosis	In progress	Eager fallback: 6-8 tok/s
Eager-mode TQ4-PRIMARY	Production	309K tokens, coherent output

Current recommended configuration: AITHER_TQ_MODE=tq4-primary with AITHER_TQ_EAGER=1. 309K tokens, coherent output, no gibberish. CUDA graph throughput recovery is the next target.

329K tokens hot on a single consumer GPU. The ceiling keeps moving.

Update: v0.9.1 — The Graph Break Massacre (11 → 40 tok/s)

April 2, 2026

The answer was 144 graph breaks per token. Per layer:

_tq_init_layer() — @torch.compiler.disable even though it's a no-op after the first call
_ensure_quantizer() — same: no-op after init, still creates a graph break
_ensure_norms() — same
_tq_encode_phase() — encodes new K/V token into TQ cache
_tq_fused_decode() — fused attention from compressed cache

The numbers:

Metric	Before	After
Graph breaks per token	144	36
CPU-GPU syncs per token	72	0
L0 merged decode time	29.34 ms	1.91 ms
Single-request tok/s	11.0	40.1
Wall time for 256 tokens	24.3s	6.4s

3.6x speedup from eliminating overhead, not from faster GPU kernels. The GPU was always fast enough — we were drowning it in Python round-trips.

Milestone	Status	Key metric
Hook-based integration (no custom backend)	Production	torch.compile compatible
Merged decode step	Production	144 → 36 graph breaks
Branchless encode	Production	72 → 0 CPU-GPU syncs
Single-request throughput	Production	40 tok/s (was 11)
`aither-kvcache` v0.9.1	Released	`pip install aither-kvcache`

The lesson: profile your graph breaks, not your kernels. The GPU was waiting for Python 144 times per token.

Update: v1.0 -- Zero Graph Breaks (custom_op Registration)

April 3, 2026

Two custom ops:

@torch.library.custom_op("tq::decode_step", mutates_args=())
def tq_decode_step(query, k_packed, k_norms, v_packed, v_norms,
                   rotation, centroids, scale, num_blocks, block_size):
    # Fused encode + rotated-domain attention
    ...

@torch.library.custom_op("tq::hybrid_decode_step", mutates_args=())
def tq_hybrid_decode_step(query, k_packed, k_norms, v_packed, v_norms,
                          rotation, centroids_hi, centroids_lo, scale,
                          num_blocks, block_size, split_dim):
    # Same, but for TQ35/TQ25 hybrid layouts
    ...

Alongside the custom ops, we built KVCacheGraph -- a 520-line faculty graph that models KV cache blocks as graph nodes.

KVCacheGraph: Blocks as a Relationship Graph

KVCacheGraph models five types of relationships between KV blocks:

Edge type	Meaning	Source
`prefix_share`	Block reused across requests (prefix caching)	vLLM block allocator
`co_attend`	Blocks frequently attended together	Attention metadata
`semantic`	Similar key vector embeddings	Embedding distance
`temporal`	Sequential in the same generation	Token position
`spill_link`	Hot VRAM block linked to cold DDR5 copy	Tier cache bridge

Each block is a KVBlockNode with an importance score, source label (system, user, assistant, tool, generation), token range, and access count. The graph supports:

Graph-aware eviction: suggest_eviction(n_blocks) returns the least-connected, lowest-importance subgraph. System prompt blocks and blocks with high prefix_share degree are protected.
Graph-aware prefetch: suggest_prefetch(active_blocks) returns cold-tier blocks that are graph-neighbors of the current working set -- blocks likely to be needed next.
Cross-session KV reuse: Blocks tagged with prefix_share edges can be identified as reusable across requests without hashing the actual KV data.

The graph is standalone -- no AitherOS dependencies. It ships as part of the aither-kvcache package.

Milestone	Status	Key metric
`tq::decode_step` custom_op	Production	Zero graph breaks (uniform modes)
`tq::hybrid_decode_step` custom_op	Production	Zero graph breaks (hybrid modes)
KVCacheGraph (5 edge types)	Production	520 lines, graph-aware eviction
Full CUDA graph capture (all TQ modes)	Production	7/7 graphs captured
RTX 5090 validation	Production	23.6 tok/s single, 120 tok/s 5x concurrent
`aither-kvcache` v1.0	Released	174 unit tests + 38 integration tests

Update: v1.3 -- Graph Eviction in Production (589 tok/s)

April 5, 2026

Two things shipped.

advisor = GraphEvictionAdvisor(graph, interval=0.5, eviction_batch=256)
advisor.start()

# Hot decode path -- zero blocking:
candidates = advisor.get_eviction_candidates(n=16)   # atomic read
prefetch = advisor.get_prefetch_candidates(active, n=8)  # atomic read

Benchmark: Nemotron-8B-AWQ on RTX 5090

We ran a comprehensive benchmark suite covering throughput, concurrency scaling, and quality:

Throughput (tq4-primary, gpu_memory_utilization=0.42):

Scenario	Result
Single request, cold, 32 tok	34.1 tok/s
Single request, cold, 256 tok	33.2 tok/s
Single request, cold, 1024 tok	31.7 tok/s
Single request, cold, 4096 tok	38.1 tok/s
5x concurrent, cold, 256 tok	174.8 tok/s aggregate
Median throughput	37.8 tok/s
20x concurrent	589 tok/s aggregate

Quality (GSM8K 5-shot, 200 samples):

Metric	Value
GSM8K accuracy	84.5%
Extraction method	Think-tag aware (extracts from `<think>` blocks)
Tool calling validation	10/10 passed

The progression from paper to production:

Version	Date	Milestone	Key metric
v0.1	Mar 20	Core library + Triton kernels	MSE within 2.7x bound
v0.5	Mar 25	TieredKVCache + vLLM integration	280K tokens, 26 tok/s
v0.7	Mar 27	Graphable fused decode	87.9 tok/s (5 concurrent)
v0.8	Mar 28	Split-k + rotation cache + buffer reuse	224 tok/s (5 concurrent)
v0.8.1	Mar 31	TQ35-PRIMARY hybrid mode	329K tokens
v0.9.1	Apr 2	Graph break massacre	40 tok/s single (was 11)
v1.0	Apr 3	`torch.library.custom_op`	Zero graph breaks
v1.3	Apr 5	Graph eviction + tq-t4nc	589 tok/s (20 concurrent)

Current recommended configuration:

AITHER_TQ_MODE=tq4-primary
AITHER_TQ_FORCE_TRITON=1
AITHER_TQ_EAGER=0

309K tokens hot. 3.9 million tokens addressable from DDR5. 589 tokens per second. The ceiling keeps moving.

Update: v2.0.0 -- TriAttention: Spectral KV Compression (9.85x in 26 Bytes)

April 6, 2026

TriAttention takes a fundamentally different approach: exploit the mathematical structure of RoPE attention itself.

The Core Insight

Rotary Position Embedding (RoPE) encodes position by rotating key/query pairs through frequency-dependent angles. The attention score between query at position m and key at position n decomposes as:

score(m, n) = sum_{f=1}^{d/2} E_f * cos(theta_f * (m - n) + phi_f)

where theta_f = base^(-2f/d) are the RoPE frequencies, E_f is the energy (amplitude) of pair f, and phi_f is its phase. This is a trigonometric series in the position difference (m - n).

What TriAttention Stores

Instead of storing the full 128-dimensional key vector (256 bytes in FP16), TriAttention stores only the top-F frequency pairs:

Component	Per pair	Total (F = 12)
Frequency index	--	12 x 1 byte = 12 B
Cosine coefficient	int4	12 x 0.5 byte = 6 B
Sine coefficient	int4	12 x 0.5 byte = 6 B
Scale + norm	float16	2 B
Total per token		26 bytes

The Compression Ladder

Method	Bytes/token	Compression	Quality	Calibration
FP16 (baseline)	256	1.0x	Exact	None
FP8 (vLLM default)	128	2.0x	Near-exact	None
TQ4 (TurboQuant)	68	3.8x	Good	None
TQ35 (hybrid)	64	4.0x	Better	None
TQ25 (aggressive)	44	5.8x	Acceptable	None
TriAttention (F=16, int8)	50	5.1x	High	Per-model
TriAttention (F=12, int4)	26	9.85x	Good	Per-model
TriAttention (F=8, int4)	18	14.2x	Acceptable	Per-model

Qwen3.5 Calibration Profiles

TriAttention ships with pre-computed spectral profiles for the Qwen3.5 model family -- the same models that power our orchestrator and reasoning pipeline:

Model	Layers	KV Heads	Default F	Compression
Qwen3.5-0.6B	28	2	10	11.6x
Qwen3.5-1.7B	28	4	12	9.85x
Qwen3.5-4B	36	8	12	9.85x
Qwen3.5-8B	36	8	14	8.5x
Qwen3.5-14B	40	8	14	8.5x
Qwen3.5-32B	64	8	16	7.5x
Qwen3.5-30B-A3B	48	4	12	9.85x

Scoring: Reconstructing Attention from Coefficients

At decode time, TriAttention reconstructs approximate attention scores directly from the stored trig coefficients -- no decompression to full vectors needed. The scorer evaluates:

s_hat(m, n) = sum_{f in top-F} a_hat_f * cos(theta_f * delta) + b_hat_f * sin(theta_f * delta)

The scorer supports two modes:

Pre-RoPE (default): Computes scores analytically from trig coefficients with RoPE phases folded in. No vector reconstruction.
Post-RoPE: Reconstructs approximate key vectors and uses standard dot-product attention. Higher quality, higher cost. Useful as a reference for validation.

The Architecture

New token arrives
  -> SpectralKVEncoder.encode():
      1. Compute RoPE frequency energies for this token's K vector
      2. Select top-F pairs by energy
      3. Compute trig coefficients (cos/sin projections)
      4. Quantize coefficients to int4/int8/int16
      5. Store SpectralEncoding in SpectralKVCache (paged blocks)

Decode attention
  -> TrigSeriesScorer.score():
      1. Load stored coefficients for cached tokens
      2. Evaluate trig series at position difference delta = q_pos - k_pos
      3. Return approximate attention scores [B, H, 1, S]
  -> Accumulate values via scatter/gather
  -> Return output

Using TriAttention

from aither_kvcache.triattention import TriAttention, TriAttentionConfig

# Default config: F=12, int4, head_dim=128 (Qwen3.5)
config = TriAttentionConfig()
tri = TriAttention(config, num_layers=36, max_seq_len=32768, device="cuda")

# Encode a KV pair
tri.encode_kv(layer_idx=0, key=k, value=v, position=pos)

# Decode attention
output = tri.forward(layer_idx=0, query=q, query_pos=pos)

# Or use a pre-calibrated Qwen3.5 profile:
from aither_kvcache.triattention.calibration import get_config_for_model
config = get_config_for_model("Qwen3.5-8B")

What Ships in v2.0.0

The aither-kvcache package is now at v2.0.0 on PyPI:

pip install aither-kvcache>=2.0.0

The package now contains two complementary compression engines:

Engine	Approach	Compression	Calibration	GPU kernels
TurboQuant	Vector quantization (codebook + rotation)	3.8-5.8x	None (data-oblivious)	Fused Triton (production)
TriAttention	Spectral decomposition (trig series)	7.5-14.2x	Per-model profiles	Reference PyTorch (v0.1)

The full module inventory for v2.0.0:

aither_kvcache.triattention.config -- TriAttentionConfig with compression ratio calculations
aither_kvcache.triattention.spectral -- RoPE frequency analysis, pair energies, topk selection, DCT
aither_kvcache.triattention.encoder -- SpectralKVEncoder (4/8/16-bit coefficient quantization)
aither_kvcache.triattention.scorer -- TrigSeriesScorer (pre-RoPE analytic, post-RoPE reference)
aither_kvcache.triattention.cache -- SpectralKVCache (paged block storage, vLLM-compatible)
aither_kvcache.triattention.attention -- TriAttention (full encode -> score -> accumulate pipeline)
aither_kvcache.triattention.calibration -- Qwen3.5 family profiles (0.6B through 32B)

The Roadmap

TriAttention v0.1 is the mathematical foundation. The path forward mirrors TurboQuant's progression:

Milestone	Target	What it unlocks
Fused Triton scorer	v2.1	GPU-native trig series evaluation, 10x+ scorer speedup
Spectral cache integration with TieredKVCache	v2.2	3-tier spectral: VRAM + DDR5 + recompute
Hybrid TQ + TriAttention	v2.3	TQ for recent tokens (fast decode), spectral for cold tokens (max compression)
CUDA graph capture for spectral decode	v2.4	Full compiled pipeline, concurrent throughput

The Progression

Version	Date	Milestone	Key metric
v0.1	Mar 20	Core library + Triton kernels	MSE within 2.7x bound
v0.5	Mar 25	TieredKVCache + vLLM integration	280K tokens, 26 tok/s
v0.7	Mar 27	Graphable fused decode	87.9 tok/s (5 concurrent)
v0.8	Mar 28	Split-k + rotation cache + buffer reuse	224 tok/s (5 concurrent)
v0.8.1	Mar 31	TQ35-PRIMARY hybrid mode	329K tokens
v0.9.1	Apr 2	Graph break massacre	40 tok/s single (was 11)
v1.0	Apr 3	`torch.library.custom_op`	Zero graph breaks
v1.3	Apr 5	Graph eviction + tq-t4nc	589 tok/s (20 concurrent)
v2.0	Apr 6	TriAttention spectral compression	9.85x compression, 184 tests
v2.1	Apr 14	Heterogeneous head_dim support	Gemma 4 MoE works with TQ

309K tokens hot on TurboQuant. 9.85x compression on TriAttention. Heterogeneous MoE architectures supported. The ceiling keeps moving.

Enjoyed this post?

All posts Try AitherOS

TurboQuant: Building a Sub-Byte KV Cache Quantizer from Paper to Production

The Paper

What We Built

The Codebook

The Rotation

Bit Packing

Triton Kernels

The Main Quantizer

Validation: Does the Math Hold?

The Test Suite

The KV Cache Manager

Wiring It Into vLLM: The Hard Part

The Reshape Crash

The Missing Piece

The Fused Kernel: Rotated-Domain Attention

The Architectural Insight

Split-K: From 5x Slower to Competitive

Full Model Simulation: Nemotron-8B

Validation on RTX 5090

Wiring the Fused Kernel Into vLLM

The FlashInfer API Mismatch

The isinstance Problem

The CUDA Device-Side Assert

Two-Tier KV Cache: Proving the Architecture

3-Tier KV Cache: VRAM + DDR5 + Recompute

Storage Layout: 5D Contiguous Tensors

Cold Tier Bandwidth

The Production Flow

vLLM Integration

PRIMARY Mode: Killing the Decompression Bottleneck

Production on Blackwell: 224 tok/s

The CUDA Graph Problem

Three Bugs Found During Live Validation

The Numbers

Hybrid TQ35: Better Quality, Same Throughput

Cloud Reasoning: TQ4 on a $0.12/hr RTX 3090

Deployment

Results

Automated Cloud Deployment

The Open-Source Package: aither-kvcache

What We Actually Built

What We Learned

Update: v0.8.1 — TQ35-PRIMARY and the CUDA Graph Trap

Update: v0.9.1 — The Graph Break Massacre (11 → 40 tok/s)

Update: v1.0 -- Zero Graph Breaks (custom_op Registration)

KVCacheGraph: Blocks as a Relationship Graph

Update: v1.3 -- Graph Eviction in Production (589 tok/s)

Benchmark: Nemotron-8B-AWQ on RTX 5090

Update: v2.0.0 -- TriAttention: Spectral KV Compression (9.85x in 26 Bytes)

The Core Insight

What TriAttention Stores

The Compression Ladder

Qwen3.5 Calibration Profiles

Scoring: Reconstructing Attention from Coefficients

The Architecture

Using TriAttention

What Ships in v2.0.0

The Roadmap

The Progression

TurboQuant: Building a Sub-Byte KV Cache Quantizer from Paper to Production

The Paper

What We Built

The Codebook

The Rotation

Bit Packing

Triton Kernels

The Main Quantizer

Validation: Does the Math Hold?

The Test Suite

The KV Cache Manager

Wiring It Into vLLM: The Hard Part

The Reshape Crash

The Missing Piece

The Fused Kernel: Rotated-Domain Attention

The Architectural Insight

Split-K: From 5x Slower to Competitive

Full Model Simulation: Nemotron-8B

Validation on RTX 5090

Wiring the Fused Kernel Into vLLM

The FlashInfer API Mismatch

The Open-Source Package: `aither-kvcache`

The Open-Source Package: `aither-kvcache`