TurboQuant: Building a Sub-Byte KV Cache Quantizer from Paper to Production
There is a constraint that dominates everything else when you serve LLMs on consumer hardware: KV cache memory. Not weights. Not activations. The KV cache. For a 32-layer, 8-head model at FP16, every token you process costs 128 KB of VRAM that stays allocated for the entire generation. At 32 GB total, with half reserved for model weights and overhead, you get maybe 80K tokens of context. That is the ceiling, and we hit it months ago.
FP8 quantization pushed it to 118K. Better, but still not enough when your agents run multi-turn conversations with full system context, tool histories, and cross-session memory retrieval. We needed more. Substantially more. And then a paper appeared on arXiv that claimed exactly what we were looking for.
The Paper
"TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate" by Zandieh et al. (arXiv:2504.19874, April 2025) describes an algorithm with a set of properties that, on first reading, sound too clean to be real:
- Data-oblivious: no calibration data needed. No representative dataset. No profiling runs.
- Online: works on streaming tokens. No batch preprocessing.
- Near-optimal: mean squared error within 2.7x of the information-theoretic lower bound.
- Sub-byte: operates at 2, 3, or 4 bits per value.
The algorithm itself is elegant. Given a vector from the KV cache:
- Compute the L2 norm and normalize to the unit sphere
- Apply a random orthogonal rotation (precomputed once, shared across all tokens)
- Scalar-quantize each coordinate using a precomputed Lloyd-Max codebook
- Pack the resulting indices into a compact bit representation
- Store the packed indices plus the scalar norm
Dequantization reverses steps 4 through 1. The key insight is step 2: rotating a vector onto the unit sphere makes each coordinate follow a Beta distribution that, for dimensions 64 and above, is well-approximated by a Gaussian. And for Gaussian-distributed scalars, the optimal quantizer is known exactly -- it is the Lloyd-Max quantizer, whose centroids and boundaries have been tabulated since the 1960s.
No learning. No calibration. Just geometry.
Within 48 hours of the paper appearing, the community had already produced a C/CUDA llama.cpp fork, a PyTorch implementation running on an RTX 3060, and a Triton kernel. We decided to build our own from scratch, validate it against the mathematics, and wire it into our vLLM inference stack. The result is aither-kvcache -- an open-source Python package (CC-BY-4.0) that you can install with pip install aither-kvcache and plug into any vLLM deployment.
What We Built
The full implementation is published as aither-kvcache on PyPI. The core library is 10 Python files, approximately 2,200 lines of code. Here is how it breaks down.
The Codebook
The foundation is a set of hardcoded Lloyd-Max codebooks for the standard normal distribution at 1 through 4 bits:
_STANDARD_CODEBOOKS: Dict[int, Dict] = {
2: {
"centroids": np.array([-1.510469, -0.452781, 0.452781, 1.510469]),
"mse": 0.117517,
},
4: {
"centroids": np.array([
-2.733266, -2.069016, -1.618002, -1.256233,
-0.942391, -0.656804, -0.388089, -0.128350,
0.128350, 0.388089, 0.656804, 0.942391,
1.256233, 1.618002, 2.069016, 2.733266,
]),
"mse": 0.009497,
},
}
These are the optimal MSE scalar quantizers for N(0,1), from Max (1960) and Lloyd (1982). To adapt them to the actual coordinate distribution N(0, 1/d), we scale by 1/sqrt(d). We also implemented a full Lloyd-Max solver using scipy's numerical integration (compute_codebook_scipy) and validated that our hardcoded values match the solver output to within 1e-4. That solver runs the iterative conditional-mean algorithm on the exact Beta distribution for lower dimensions, and the Gaussian approximation for d >= 64.
The Rotation
The rotation module provides two methods for generating the random orthogonal matrix:
def random_orthogonal(d: int, seed: int = 42, device: str = "cuda",
dtype: torch.dtype = torch.float32) -> torch.Tensor:
gen = torch.Generator(device="cpu").manual_seed(seed)
G = torch.randn(d, d, generator=gen, dtype=dtype)
Q, R = torch.linalg.qr(G)
diag_sign = torch.sign(torch.diag(R))
diag_sign[diag_sign == 0] = 1.0
Q = Q * diag_sign.unsqueeze(0)
return Q.to(device=device)
This is the standard Mezzadri (2007) approach: QR decomposition of a random Gaussian matrix yields a matrix uniformly distributed over the orthogonal group under Haar measure. The sign correction on the diagonal ensures a proper rotation (determinant +1).
We also provide a Randomized Hadamard Transform (RHT) option: H * D_k * H * D_{k-1} * ... * H * D_1, where H is the normalized Walsh-Hadamard matrix and each D_i is a random diagonal sign matrix. With 3 or more rounds, this closely approximates a Haar-random orthogonal matrix. The RHT applies in O(d log d) via butterfly operations versus O(d^2) for a full matrix multiply, though for head_dim=128 the difference is negligible.
Bit Packing
Packing is mechanical but must be correct. Each bit-width has a different packing layout:
- 4-bit: pair values into nibbles, 2 values per byte. head_dim=128 produces 64 bytes.
- 3-bit: group 8 values into 24 bits (3 bytes). head_dim=128 produces 48 bytes.
- 2-bit: group 4 values into 1 byte. head_dim=128 produces 32 bytes.
The 3-bit case is the least obvious. We pack 8 three-bit indices into a 24-bit integer, then split it into 3 bytes:
def pack_3bit(indices: torch.Tensor) -> torch.Tensor:
groups = D // 8
idx = indices.to(torch.int32).reshape(*indices.shape[:-1], groups, 8)
bits24 = torch.zeros(*idx.shape[:-1], dtype=torch.int32, device=indices.device)
for i in range(8):
bits24 = bits24 | ((idx[..., i] & 0x7) << (21 - i * 3))
byte0 = ((bits24 >> 16) & 0xFF).to(torch.uint8)
byte1 = ((bits24 >> 8) & 0xFF).to(torch.uint8)
byte2 = (bits24 & 0xFF).to(torch.uint8)
packed = torch.stack([byte0, byte1, byte2], dim=-1)
return packed.reshape(*indices.shape[:-1], groups * 3)
All packing operations handle arbitrary batch dimensions. This matters because KV cache tensors come in shapes like [batch, seq, heads, dim] and we need to pack along the last axis without reshaping-away the batch structure.
Triton Kernels
For the GPU path, we wrote fused Triton kernels that combine quantization and packing into a single pass, avoiding intermediate tensor allocation. Here is the 4-bit quantize-and-pack kernel:
@triton.jit
def _quantize_pack_4bit_kernel(
y_ptr, boundaries_ptr, packed_ptr, N,
D: tl.constexpr, HALF_D: tl.constexpr,
):
pid = tl.program_id(0)
if pid >= N:
return
pair_offs = tl.arange(0, HALF_D)
base = pid * D
y_even = tl.load(y_ptr + base + pair_offs * 2)
y_odd = tl.load(y_ptr + base + pair_offs * 2 + 1)
# Vectorized searchsorted: count boundaries exceeded
idx_even = tl.zeros([HALF_D], dtype=tl.int32)
idx_odd = tl.zeros([HALF_D], dtype=tl.int32)
for b in tl.static_range(15):
boundary = tl.load(boundaries_ptr + b)
idx_even += (y_even >= boundary).to(tl.int32)
idx_odd += (y_odd >= boundary).to(tl.int32)
packed = ((idx_even & 0xF) << 4) | (idx_odd & 0xF)
tl.store(packed_ptr + pid * HALF_D + pair_offs, packed.to(tl.uint8))
One thread per vector. Loads even and odd coordinates, runs the searchsorted as a linear scan over 15 boundaries (faster than binary search at this width), and packs both nibbles into one byte. The dequantize kernel does the reverse: unpack nibbles, gather from the codebook, store interleaved.
We have fused kernels for 2-bit and 4-bit. The 3-bit Triton kernel is a stub that falls back to the PyTorch path. Bit-packing 8 three-bit values into 3 bytes inside Triton's programming model is doable but complex, and the 2-bit and 4-bit widths cover the practical sweet spots.
The Main Quantizer
The TurboQuant class ties everything together:
tq = TurboQuant(head_dim=128, bits=4, device="cuda")
packed, norms = tq.encode(kv_vectors) # [..., 128] -> [..., 64] uint8 + [...] f32
decoded = tq.decode(packed, norms) # [..., 64] uint8 + [...] f32 -> [..., 128] f16
It automatically selects the Triton path when CUDA is available and falls back to pure PyTorch otherwise. It carries the rotation matrix, codebook, and packing dispatch as instance state, so encoding and decoding are stateless operations on the input tensors.
Validation: Does the Math Hold?
This is where we stop trusting the paper and start trusting measurements. We ran encode-decode roundtrips on 10,000 random unit vectors at each bit-width and compared our measured MSE against the paper's theoretical bounds.
The theoretical framework: for b-bit quantization of vectors on S^{d-1}, the MSE lower bound is 1/4^b (information-theoretic), and the paper proves an upper bound of (3*pi/2)/4^b. The ratio between achieved MSE and the lower bound should be at most 2.7x.
| Bits | Measured MSE | Theory Lower | Theory Upper | Ratio to Lower | Paper Claim |
|---|---|---|---|---|---|
| 4-bit | 0.0093 | 0.0039 | 0.0184 | 2.39x | at most 2.7x |
| 3-bit | 0.0340 | 0.0156 | 0.0736 | 2.17x | at most 2.7x |
| 2-bit | 0.1161 | 0.0625 | 0.2945 | 1.86x | at most 2.7x |
Every bit-width falls comfortably within the claimed bounds. The ratios actually improve at lower bit counts, which aligns with theory: the Lloyd-Max quantizer is closer to optimal when there are fewer levels.
Compression ratios versus FP16 (each 128-dim vector is 256 bytes at FP16):
| Bits | Packed Bytes | + Norm (f32) | Total | Compression |
|---|---|---|---|---|
| 4-bit | 64 | 4 | 68 | 3.8x |
| 3-bit | 48 | 4 | 52 | 4.9x |
| 2-bit | 32 | 4 | 36 | 7.1x |
The norm overhead (4 bytes per vector for a float32 scalar) is the tax you pay for data-obliviousness. A calibration-based quantizer could absorb the norm into the codebook, but then you need calibration data and lose the online property.
The Test Suite
92+ tests. 87 pass on our Windows development host (CPU path), 5 are GPU-only and skip cleanly. The tests cover:
- Codebook correctness: centroids are sorted, symmetric, correctly scaled by dimension
- Packing roundtrips: pack-then-unpack recovers exact indices for all three bit-widths
- Rotation orthogonality: Q^T * Q = I to within floating-point tolerance
- MSE bounds: measured distortion is within theoretical range
- Compression ratios: byte counts match expectations
- KV cache store/decompress roundtrips
- Multi-layer cache operations
- Block copy (for beam search / copy-on-write)
- Memory estimation accuracy
The KV Cache Manager
With the core quantizer validated, we built a full KV cache manager targeting vLLM's paged attention interface:
cache = TurboQuantKVCache(
num_blocks=2500, block_size=16,
num_kv_heads=8, head_dim=128,
num_layers=32, bits=4,
)
# Per-layer in model forward:
cache.store(layer_idx, keys, values, slot_mapping)
key_cache, value_cache = cache.decompress_layer(layer_idx)
# Feed key_cache, value_cache to standard paged attention
The memory layout is split into two regions:
- Permanent compressed storage: all layers, all blocks, stored as packed uint8 indices plus float32 norms
- Reusable decompression buffer: one layer's worth of FP16 KV data, shared across all layers
The buffer is the trick that makes this practical. You only need one layer decompressed at a time during the forward pass, so you allocate a single buffer and reuse it as you walk through the 32 layers. The buffer cost is amortized across all tokens and is small relative to the compressed storage.
The cache manager also provides selective decompression (decompress_blocks) for partially-filled caches, and block copy operations for beam search and copy-on-write -- both operating directly on the compressed representation without decompressing first.
Wiring It Into vLLM: The Hard Part
Our production setup: Nemotron-Orchestrator-8B on an RTX 5090 (32 GB VRAM), vLLM v0.15.1 with FlashInfer attention backend, gpu_memory_utilization=0.45, FP8 KV cache. This gives us 118,112 tokens of KV cache capacity.
The goal: make vLLM allocate blocks based on TQ-compressed page sizes instead of FP8 sizes, so it creates more blocks for the same VRAM budget.
vLLM v0.15 uses a multi-process architecture: the API server runs in pid 1, and the EngineCore workers run in separate processes (pid 115, 119, etc.). Both processes import vLLM's KV cache specification classes, and both need to be patched.
We solved this with a sitecustomize.py import hook:
# tq_sitecustomize.py -- placed on PYTHONPATH
# Patches vLLM BEFORE any module loads, in every process
import importlib
import sys
class TQImportFinder:
def find_module(self, name, path=None):
if name == "vllm.v1.kv_cache_interface":
return self
return None
def load_module(self, name):
# Load the real module first
real_module = importlib.import_module(name)
# Then patch it
apply_tq_patches(real_module)
return real_module
sys.meta_path.insert(0, TQImportFinder())
The hook intercepts vLLM's KV cache interface module at import time and patches real_page_size_bytes on both AttentionSpec and FullAttentionSpec. Both must be patched -- FullAttentionSpec shadows the parent class's property, so patching only AttentionSpec has no effect.
The patched property computes the TQ-compressed page size:
@property
def _tq_real_page_size_bytes(self):
from lib.gpu.turboquant.packing import packed_size
pd = packed_size(self.head_size, _TQ_BITS)
return 2 * self.block_size * self.num_kv_heads * (pd + 4)
For Nemotron-8B with TQ4 (block_size=16, heads=8, head_dim=128): the standard FP8 page size is 32,768 bytes per block. Our TQ4 page size is 17,408 bytes. That ratio -- 1.88x more blocks for the same memory -- is the payoff.
With the patch active, vLLM calculated 222,336 tokens of KV cache capacity. Up from 118,112. An 88% increase, from a property override.
The Reshape Crash
The 222K allocation crashed.
Here is exactly why. When vLLM allocates KV cache, it does this:
# Inside vLLM's cache allocation
raw_bytes = page_size_bytes * num_blocks
kv_cache_tensor = torch.zeros(raw_bytes, dtype=torch.int8, device="cuda")
So far, correct. Our patched page_size_bytes is smaller, so raw_bytes is smaller for the same number of blocks. The total allocation fits in VRAM.
But then vLLM reshapes that raw tensor into the structured format that FlashInfer's CUDA attention kernel expects:
# vLLM's tensor reshape
kv_cache = raw_tensor.view(fp8_dtype).view([2, num_blocks, block_size, num_heads, head_dim])
This reshape expects 2 * num_blocks * block_size * num_heads * head_dim * sizeof(fp8) bytes. The standard shape. Our patched page size made the raw tensor smaller -- correctly sized for TQ storage -- but the reshape still expects the standard FP8 dimensions. The smaller tensor cannot fill the larger shape. The reshape fails with a size mismatch.
The fundamental issue: we told vLLM's allocator to reserve less memory per block, but we did not change the tensor format that the attention kernel reads from. vLLM's allocation path and its consumption path assume the same format, and we only patched the allocation side.
The Missing Piece
FlashInfer's paged attention kernel reads from [2, num_blocks, block_size, num_heads, head_dim] tensors in FP8 or FP16. It does not understand packed uint8 indices. To serve attention from TQ-compressed storage, we need to close the gap between our compressed format and the kernel's expected input.
There are three paths forward:
Option A: Fused Triton attention kernel. Write a custom attention kernel that reads directly from TQ-compressed storage. The kernel would: (1) unpack indices, (2) look up centroids, (3) apply inverse rotation, (4) rescale by norm, (5) compute attention scores -- all in a single fused pass. We already have the Triton dequantize kernels in triton_ops.py. Fusing them with attention score computation is the remaining work. This is the maximum-efficiency path: zero decompression buffer, zero memory overhead beyond the compressed storage.
Option B: Decompression buffer. Keep FlashInfer's standard attention kernel. Before each layer's attention computation, decompress the TQ cache into a standard-format buffer, run attention, then discard the buffer. This is what our TurboQuantKVCache.decompress_layer() already does. The problem: the buffer must be large enough to hold one full layer of decompressed KV data, and it scales with num_blocks, partially eating the compression savings.
Option C: Upstream contribution. Add TQ as a native kv_cache_dtype in vLLM, alongside FP8 and FP16. This would integrate at the allocation, storage, and attention levels simultaneously. It is the cleanest solution but requires coordination with the vLLM project.
We chose Option A.
The Fused Kernel: Rotated-Domain Attention
The breakthrough insight: you don't need to dequantize the keys at all. Instead of inverse-rotating each K vector, rotate the query forward once and compute dot products in the rotated domain:
score_i = ||k_i|| * (Pi @ q)^T @ y_hat_i / sqrt(d)
This is mathematically equivalent to q^T @ k_i / sqrt(d) but avoids the per-vector 128x128 matmul. The query rotation is a one-time cost. The per-token work is just a codebook lookup and a dot product.
The same trick works for value accumulation:
output = Pi^T @ (sum_i w_i * ||v_i|| * y_hat_v_i)
Accumulate in the rotated domain, rotate back once at the end. Two 128x128 matmuls total, regardless of context length.
We built a PyTorch reference implementation (TQPagedAttentionRef in fused_attention.py) and validated it with 16 tests:
- 5 math equivalence tests: rotated dot product, norm-scaled scoring, even/odd split, V accumulation, online softmax -- all pass to 1e-5 tolerance
- 6 functional tests: single block, multi-block, partial last block, zero context, shape validation, fused-vs-decompress comparison -- all pass with >0.99 cosine similarity
- 5 numeric tests: large norms, single token, 2-bit/3-bit/4-bit bit-widths -- all pass
The fused kernel reads directly from packed[block_size, num_kv_heads, packed_dim] uint8 tensors and norms[block_size, num_kv_heads] float32 tensors. No decompression buffer.
The Architectural Insight
The original Triton kernel stub tried to do everything inside the kernel: rotation, even/odd splitting, codebook gather, attention, inverse rotation. It hit Triton's register limitations: you cannot index a register tensor by a computed value (q_rot[i]), and even/odd slicing (q_rot[0::2]) is unreliable across Triton versions.
The fix is architectural, not algorithmic. Move rotation and interleaving outside the kernel into PyTorch, where they are trivial:
Pre-processing (PyTorch):
q_rot = query @ rotation.T # one matmul per batch
q_even = q_rot[..., 0::2] # stride slice
q_odd = q_rot[..., 1::2]
Kernel (Triton):
For each (seq, head): online softmax over compressed KV tokens
- Load [HALF_D] packed uint8
- Codebook gather: tl.load(centroids + nibble_indices)
- Score: k_norm * (dot(q_even, k_even) + dot(q_odd, k_odd)) * scale
- Online softmax + V accumulation in separate even/odd registers
Post-processing (PyTorch):
out_rot[..., 0::2] = acc_even
out_rot[..., 1::2] = acc_odd
output = out_rot @ rotation # inverse rotation
Every register array inside the kernel is [HALF_D=64] shaped. No computed indexing, no stride slicing, no in-kernel matmul. The codebook gather (tl.load(centroids_ptr + indices)) is native Triton. The even/odd split that blocked the original implementation is now a zero-cost stride operation in PyTorch.
Split-K: From 5x Slower to Competitive
The initial fused kernel launched one Triton program per query head -- 32 programs on 170 SMs, 19% GPU utilization. Each program serially processed all KV blocks. At 16K context, that was 537ms for 36 layers.
Split-k fixes this by partitioning KV blocks across 8 parallel programs per head. Phase 1 computes partial softmax results per split; phase 2 merges them via log-sum-exp reduction. Grid goes from (32, 1) to (32, 1, 8) = 256 programs -- saturating the GPU.
Results: 27x faster at 16K context (19.6ms vs 537ms). Per-layer latency of 544us is competitive with FlashAttn for decode. Short contexts (< 2K tokens) use the single-program kernel to avoid split-k overhead.
Full Model Simulation: Nemotron-8B
We simulated a complete 36-layer decode pass matching the orchestrator's exact configuration (32 query heads, 8 KV heads, GQA 4:1, head_dim 128):
| Context | 36-layer decode | Per-layer | TQ4 memory | FP8 memory | Saved |
|---|---|---|---|---|---|
| 256 | 6.7 ms | 186 us | 10 MB | 18 MB | 8 MB |
| 1K | 20 ms | 561 us | 38 MB | 72 MB | 34 MB |
| 2K | 8.5 ms | 235 us | 77 MB | 144 MB | 67 MB |
| 4K | 19.6 ms | 544 us | 153 MB | 288 MB | 135 MB |
| 8K | 19.6 ms | 544 us | 306 MB | 576 MB | 270 MB |
| 16K | 19.6 ms | 544 us | 612 MB | 1,152 MB | 540 MB |
The 2K+ rows use split-k parallelism (8-way block splitting across GPU SMs). The 8K and 16K numbers are identical because our test cache was capped at 256 blocks -- the per-layer latency (544 us) is the meaningful metric. It scales flat because split-k distributes the work.
At a 20 GB VRAM budget for KV cache:
| Format | Max tokens | Concurrent 40K sequences |
|---|---|---|
| FP8 | 291,271 | 7.1x |
| TQ4 | 548,275 | 13.4x |
| Gain | +257,004 | +6.3 extra sequences |
Validation on RTX 5090
We validated the fused kernel against the PyTorch reference implementation across context lengths from 1 to 4,096 tokens:
ctx= 1 cosine=1.000000 maxdiff=0.000001
ctx= 16 cosine=1.000000 maxdiff=0.000122
ctx= 256 cosine=1.000000 maxdiff=0.000000
ctx= 1024 cosine=1.000000 maxdiff=0.000031
ctx= 4096 cosine=1.000000 maxdiff=0.000004
Perfect cosine similarity at every length. Maximum absolute difference never exceeds FP16 epsilon (0.000122).
Wiring the Fused Kernel Into vLLM
With a validated fused kernel in hand, we turned to integrating it into the actual vLLM forward pass. This is where the real engineering begins.
The FlashInfer API Mismatch
When we deployed the fused kernel into the Docker container, we discovered that vLLM v0.15 changed the Attention.forward() signature from what we expected. The KV cache is no longer passed as a function argument -- it's stored as self.kv_cache on the module, and attention metadata comes from a context manager (get_forward_context().attn_metadata), not as a positional argument.
This means patching Attention.forward() requires adapting to the v0.15-specific API where:
- The cache write uses
torch.ops.vllm.unified_attention(a registered custom op) - The attention metadata is implicit, not explicit
- The forward signature is
forward(self, query, key, value, output_shape=None)
We adapted: instead of patching Attention.forward(), we hook GPUModelRunner.load_model and replace each layer's self.impl (the FlashInfer backend instance) with our _TQAttentionImplWrapper after model load. This matches the correct v0.15 call path: self.impl.forward(layer, query, key, value, kv_cache, attn_metadata, output=output).
The isinstance Problem
vLLM's internals check isinstance(impl, FlashInferImpl) in several places. Replacing the impl object with our wrapper class would fail those checks and crash the attention metadata builder.
The fix: wrap forward() in-place on the original FlashAttn/FlashInfer impl objects, rather than replacing them. This preserves class identity -- isinstance(impl, FlashInferImpl) passes because the object IS still a FlashInferImpl. It just has a monkey-patched forward().
In the latest deployment, all four monkey-patches fire successfully in the EngineCore worker:
page_size_bytespatched (221,808 tokens calculated -- 1.88x over FP8)_reshape_kv_cache_tensorspatched (36 raw TQ tensors returned, standard reshape skipped)- 36
Attention.implinstances replaced with TQ wrappers enforce_eager=Trueactive (prevents torch.compile from breaking patches)
The CUDA Device-Side Assert
The final hurdle in the initial integration was the TQ encode itself. Running the rotation matmul on CUDA inside the attention forward path triggered device-side asserts that poisoned the entire CUDA context -- killing not just TQ but the subsequent FlashAttn computation too. CUDA device-side asserts are fatal and uncatchable by Python try/except. Once one fires, every subsequent CUDA call in the process returns cudaErrorAssert until the process exits.
The fix: run TQ encode on CPU. Copy K,V to CPU, encode there, store the packed result. This avoids any CUDA context contamination. The CPU encode adds latency but proves the pipeline works end-to-end. With this change, the orchestrator boots, serves inference, and all 36 layers successfully shadow-encode every token:
[TQ] Layer 0: shadow encode OK (16 tokens -> packed torch.Size([128, 64]))
[TQ] Layer 1: shadow encode OK (16 tokens -> packed torch.Size([128, 64]))
...
[TQ] Layer 35: shadow encode OK (16 tokens -> packed torch.Size([128, 64]))
The complete pipeline -- paper math, codebook, rotation, quantize, pack, production forward pass, 36 layers encoding real inference tokens -- is proven end-to-end.
Two-Tier KV Cache: Proving the Architecture
With the encode pipeline proven via CPU fallback, we implemented a two-tier KV cache architecture as the stepping stone to the full fused path:
- Hot tier: Standard FP8 cache managed by vLLM (118K tokens, VRAM)
- Cold tier: TQ4-compressed CPU storage via
TieredKVCache(unlimited capacity, system RAM)
Every token written to the FP8 cache is simultaneously TQ-compressed and stored in the cold tier. FlashAttn handles all attention computation on the hot tier. The cold tier is ready for block retrieval when needed: tiered.warm_blocks(layer_idx, block_indices, hot_cache) decompresses selected blocks from TQ storage back into the FP8 cache.
Request -> FlashAttn writes K,V to FP8 hot cache
-> FlashAttn computes attention output
-> TieredKVCache.compress_new() copies K,V to CPU, TQ-encodes, stores
-> (later) TieredKVCache.warm_blocks() decompresses back when needed
This architecture:
- Adds zero latency to the attention path (TQ encode runs AFTER FlashAttn completes)
- Protects CUDA context (CPU-side encode can't trigger device-side asserts)
- Preserves exact FlashAttn output (hot tier data is untouched by TQ)
- Enables future block eviction (cold tier already holds compressed copies of all blocks)
In production logs:
[TQ] L0: cold tier encode OK (16 tokens)
[TQ] L1: cold tier encode OK (16 tokens)
...
[TQ] L35: cold tier encode OK (16 tokens)
All 36 layers actively encoding to the cold tier during live inference. The orchestrator serves requests correctly with TQ enabled (AITHER_TQ_BITS=4).
While building the TQ cold tier, we realized the orchestrator's gpu_memory_utilization was set to 0.45 -- a holdover from when the reasoning model shared the local GPU. The reasoning model moved to cloud GPUs months ago. The orchestrator had 11 GB of VRAM sitting idle.
We bumped gpu_memory_utilization to 0.80:
BEFORE: 118,112 tokens (util=0.45, FP8, 2.88x concurrency)
AFTER: 280,144 tokens (util=0.80, FP8 + TQ cold tier, 6.84x concurrency)
280K tokens. 6.84 concurrent full-context requests. The TQ cold tier runs in parallel, async-compressing every token to CPU storage via a background thread with zero-sync GPU-to-CPU transfers.
3-Tier KV Cache: VRAM + DDR5 + Recompute
With the fused kernel validated and the two-tier architecture proven, we built the full 3-tier KV cache that exploits the 128 GB of DDR5 system RAM alongside the 32 GB RTX 5090.
Tier 1 -- VRAM (fused Triton decode) TQ4-compressed KV cache on GPU. Fused kernel reads directly, no decompression. 548K tokens in 20 GB.
Tier 2 -- DDR5 (cold tier, bulk transfer) TQ4-compressed blocks in pinned system RAM. Same format as VRAM -- block warming is a raw memcpy, not a decompress. At 34 KB per token: 128 GB holds 3.9 million tokens.
Tier 3 -- Recompute (cache miss) Only when both tiers miss (prefix never seen before).
Storage Layout: 5D Contiguous Tensors
The key to fast spill/warm is the storage layout. Both VRAM and DDR5 use contiguous 5D tensors:
# [num_layers, max_blocks, block_size, num_kv_heads, packed_dim] uint8
# Spill: 4 bulk GPU->CPU copies (all layers in one shot)
cold_k_packed[:, block_indices] = gpu_k_packed[:, block_indices].cpu()
# Warm: 4 bulk CPU->GPU copies (pinned memory, non_blocking)
gpu_k_packed[:, block_indices] = cold_k_packed[:, block_indices].to('cuda', non_blocking=True)
Four tensor copy operations total, regardless of layer count. No per-layer Python loop.
Cold Tier Bandwidth
Measured on RTX 5090 with Nemotron-8B (36 layers, 8 KV heads):
| Blocks | Tokens | Data | Warm time | Bandwidth |
|---|---|---|---|---|
| 64 | 1K | 38 MB | 284 ms | 135 MB/s |
| 256 | 4K | 153 MB | 480 ms | 319 MB/s |
| 1024 | 16K | 612 MB | 571 ms | 1,072 MB/s |
At scale (1024+ blocks), warm bandwidth reaches 1 GB/s over PCIe 4.0. The fused kernel then serves decode attention directly from the warmed TQ blocks -- no decompression step between DDR5 and attention.
The Production Flow
New token -> TQ encode on GPU -> store in VRAM Tier 1
-> async spill to DDR5 Tier 2 (background, non-blocking)
Normal decode -> fused Triton kernel reads VRAM TQ directly
Block eviction -> blocks already in DDR5 (spilled earlier)
-> VRAM space freed for new sequences
Prefix cache hit on evicted sequence -> warm from DDR5 (raw TQ memcpy)
-> fused decode immediately
-> no recompute, no decompress
vLLM Integration
The TurboQuantImpl attention backend registers via vLLM's plugin system and routes automatically:
- Decode (
max_query_len == 1): Fused TQ kernel on VRAM-resident compressed data - Prefill (
max_query_len > 1): Standard Triton attention on FP8 vLLM cache - Both paths: Encode new K,V to TQ GPU storage for future decode use
The backend is activated with --attention-backend CUSTOM (vLLM CLI flag) and AITHER_TQ_FUSED=1. The env var VLLM_ATTENTION_BACKEND is not read by vLLM v0.15.1's V1 engine -- only the CLI flag works. It falls through to standard Triton on any error, so there is no risk of degraded inference.
PRIMARY Mode: Killing the Decompression Bottleneck
SHADOW mode keeps the standard FP8 cache as the primary store and writes a compressed copy on the side. PRIMARY mode (AITHER_TQ_MODE=tq4-primary) goes further: the vLLM KV cache tensor IS the TQ cache. uint8 layout, 3.8x more blocks, 309K+ tokens addressable. But the original PRIMARY decode path had a problem.
The first implementation used batch decompression: gather all active TQ blocks, call tq.decode() to expand them into a temporary bf16 buffer, then call unified_attention on the decompressed data. This meant decompressing every active block on every decode step, for every layer. At 128 blocks and 36 layers, that is 576K vector decompressions per generated token. The result: 8-12 tok/s. The decompression was the bottleneck, not the attention.
The fix was to use the same fused TQPagedAttention kernel that SHADOW mode already had. The kernel reads packed uint8 indices and f32 norms directly from the primary cache, unpacks nibbles in-register during attention (codebook lookup + rotation in-kernel), and never materializes a decompression buffer. Zero intermediate memory. The key change in _forward_primary_decode():
# Before: decompress everything, then attend
self._batch_decompress_active_blocks(...) # 576K decompressions/token
unified_attention(decompressed_buf, ...)
# After: fused kernel reads compressed data directly
self._ensure_fused_attn() # TQPagedAttention from SHADOW mode
fused_out = self._fused_attn.forward(
query=query.to(torch.float32),
k_packed=key_cache[:, :, :, :packed_dim], # raw uint8
k_norms=primary_k_norms[layer_idx], # f32
...
)
The [:, :, :, :packed_dim] slice produces a non-contiguous view (stride mismatch from the 4-byte padding in tq_dim=68 vs packed_dim=64). The Triton kernel handles this via explicit stride-based pointer arithmetic -- no .contiguous() copy needed.
The old batch-decompress path is preserved as _forward_primary_decode_slow() -- a fallback that triggers only if the fused kernel fails to initialize. In production it never fires. Single-request throughput stays at ~12.8 tok/s (the model's MLP layers dominate at short context), but the fused path eliminates the decompression VRAM allocation entirely and scales better at long contexts where the old path's O(active_blocks * layers) decompress cost was the wall.
Production on Blackwell: 224 tok/s
This is the part where everything comes together.
The RTX 5090 is NVIDIA's Blackwell consumer GPU -- SM_120, compute capability 12.0. Triton kernels compiled for Ampere/Hopper (SM_80/SM_90) don't automatically work on Blackwell. The SM_100+ instruction set introduces new register file layouts and warp scheduling. Our initial guard conservatively disabled Triton on SM_100+ and fell back to the PyTorch reference implementation -- a pure Python nested loop that runs at 0.5 tok/s. Unusable.
We added AITHER_TQ_FORCE_TRITON=1 to override the Blackwell guard and test the Triton kernel directly on SM_120. Result: it works. Triton's compiler generates valid PTX for Blackwell without modification. The split-k attention kernel, the quantize/dequantize ops, the paged cache lookups -- all functional on SM_120.
With the fused kernel running on Blackwell and eager mode, we measured 26.1 tok/s at 5 concurrent requests. Good, but not the final number. The --enforce-eager flag was still active -- a leftover from early TQ development, when the custom attention backend couldn't survive CUDA graph capture.
The CUDA Graph Problem
CUDA graph capture calls forward() to record the computation graph, then replays the recorded graph for subsequent calls with the same input shapes. TQ's forward has dynamic control flow -- cache initialization, conditional encode, decode routing based on max_query_len. CUDA graphs cannot record branching. The capture fails with cudaErrorStreamCaptureInvalidated.
The fix is a single check at the top of forward():
if torch.cuda.is_current_stream_capturing():
return super().forward(...) # pure delegation during graph recording
During graph capture, TQ is invisible -- vLLM records the standard Triton attention path. During live inference (outside graph replay), TQ runs its full encode + fused decode pipeline. The two modes coexist because CUDA graphs replay the captured path for batch sizes that were captured, and fall through to eager mode for sizes that weren't.
Three Bugs Found During Live Validation
The Blackwell deployment exposed three bugs that only manifested under production conditions:
-
Shadow cache gating: TQ cache initialization was gated behind
_fused_enabled, preventing encode in shadow-only mode (FUSED=0). Shadow cache must encode regardless -- only the decode kernel is gated. -
Environment passthrough (fixed):
AITHER_TQ_FORCE_TRITONandAITHER_TQ_EAGERwere in.envbut not in the Docker Composeenvironment:section, so they never reached the container. Docker Compose only passes env vars explicitly listed inenvironment:orenv_file:. Fix: both are now in the composeenvironment:block with defaults (AITHER_TQ_FORCE_TRITON=1,AITHER_TQ_EAGER=0). -
OOM on cache allocation: TQ cache was allocated using 20% of free VRAM (up to 2.5 GiB), but vLLM's internal memory pool doesn't account for external CUDA allocations. Capped at 512 MiB with a hard try/except -- TQ cache allocation can never crash the EngineCore subprocess.
The Numbers
With AITHER_TQ_EAGER=0 enabling torch.compile and CUDA graphs:
| Mode | 5-concurrent tok/s | Improvement |
|---|---|---|
--enforce-eager (before) | 26.1 | baseline |
| torch.compile + CUDA graphs | 64.0 | +145% |
| Graphable fused decode (v0.7.0) | 87.9 | +237% |
| v0.8.0 optimizations (rotation cache, buffer reuse, split-k 64/16) | 224.1 | +758% |
| Metric | Value |
|---|---|
| Architecture | RTX 5090 (SM_120, Blackwell) |
| Model | Nemotron-Orchestrator-8B-AWQ-4bit |
| Attention backend | CUSTOM (TurboQuantImpl) |
| Decode throughput (1 req, cold) | 13.5 tok/s |
| Decode throughput (1 req, hot cache) | 15.8 tok/s |
| Decode throughput (5 concurrent) | 224.1 tok/s |
| Decode throughput (10 concurrent) | 288.6 tok/s |
| Decode throughput (20 concurrent) | 115.2 tok/s (GPU saturated) |
| Max generation (cold, 4096 tok limit) | 2,666 tokens generated |
| Max generation (hot, 4096 tok limit) | 3,895 tokens generated |
| Hot cache speedup (single req) | 12-17% prefix cache savings |
| TQ cache | 36 layers x 856 blocks @ TQ4 |
| TQ VRAM overhead | 512 MB GPU + 512 MB DDR5 |
| Total VRAM | 22.9 / 32.6 GiB (9.3 GiB free) |
| Runtime errors | 0 (stress tested, 20x concurrent) |
The effective context budget with the fused kernel live and the 3-tier cache operational:
| Tier | Storage | Format | Capacity |
|---|---|---|---|
| Hot (VRAM) | 20 GiB KV budget | TQ4 | ~280K tokens |
| Cold (DDR5) | 128 GiB system RAM | TQ4 | ~3.9M tokens |
| Recompute | -- | -- | Unlimited (cache miss penalty) |
280K tokens hot in VRAM. 3.9 million tokens addressable from DDR5 with sub-millisecond block warming. 64 tokens per second fused decode. On a single consumer GPU.
Hybrid TQ35: Better Quality, Same Throughput
The same day we validated Blackwell, we integrated a hybrid bit-width quantizer inspired by mitkox/vllm-turboquant. Instead of uniform 4-bit across all dimensions, TQ35 splits each head into two groups:
- Outlier group (50% of dims, high variance): 3-bit MSE codebook + 1-bit QJL residual
- Regular group (50% of dims, low variance): 2-bit MSE codebook + 1-bit QJL residual
The QJL residual is the key innovation: after MSE quantization, the quantization error is projected through a separate structured Hadamard transform and stored as 1-bit sign bits. This captures the direction of the error vector at almost zero cost -- one bit per dimension, using the Johnson-Lindenstrauss property that random projections preserve inner products in expectation.
The result: 3.5 average bits per value, same 64 bytes per vector as TQ4, but measurably better reconstruction quality because the bit budget is allocated where it matters most.
| Mode | Avg Bits | Bytes/Vector | vs FP16 | Quality |
|---|---|---|---|---|
| TQ4 (uniform) | 4.0 | 68 | 3.8x | Good |
| TQ35 (hybrid) | 3.5 | 64 | 4.0x | Better (QJL residual) |
| TQ25 (hybrid) | 2.5 | 44 | 5.8x | Acceptable |
In production, TQ35 allocated 910 blocks versus TQ4's 856 -- 6% more capacity in the same 512 MiB VRAM budget. Throughput is comparable to TQ4. Zero encode errors. Set AITHER_TQ_MODE=tq35 to enable.
Cloud Reasoning: TQ4 on a $0.12/hr RTX 3090
The local orchestrator was not the only model starved for context. Our deep reasoning model -- DeepSeek-R1-distill-Qwen-14B-AWQ running on a Vast.ai RTX 3090 -- was configured for 32,768 tokens of context. It was getting 8,192.
The 14B AWQ model occupies roughly 8 GB of VRAM. On a 24 GB RTX 3090 at 90% utilization, that leaves approximately 13.6 GB for KV cache. With FP16 (the default kv_cache_dtype), each token costs 2 (K+V) x 8 (KV heads) x 128 (head dim) x 2 bytes x 48 layers = 196,608 bytes. Divide 13.6 GB by 192 KB per token and you get around 72K tokens. Plenty of room. So why 8,192?
vLLM v0.17 auto-reduces max_model_len when the profiler determines that the requested context length cannot fit alongside the model, CUDA overhead, and activation memory. The profiler runs with the standard FP16 page size. At standard sizing, 32K tokens of KV cache demands 32768 * 196608 = 6.1 GB per concurrent sequence, and at 16 max sequences, that far exceeds the budget. vLLM silently capped to 8,192.
TQ4 changes the arithmetic. Each page drops from 65,536 bytes to 17,408 bytes -- 3.76x smaller. The profiler, seeing the TQ-patched page size, allocates 3.76x more blocks and accepts the full 32K context.
Deployment
The reasoning model runs on Vast.ai -- a rented RTX 3090 accessed via SSH. To deploy TurboQuant:
- Tar the
lib/gpu/turboquant/library andtq-vllm-server.pylauncher (72 KB total) - SCP to the instance
- Kill vanilla vLLM, restart with the TQ wrapper
# On the Vast.ai instance:
export AITHER_TQ_BITS=4 AITHEROS_ROOT=/app PYTHONPATH=/app:/app/lib
python3 tq-vllm-server.py \
--model casperhansen/deepseek-r1-distill-qwen-14b-awq \
--served-model-name deepseek-r1:14b \
--max-num-seqs 16 --max-model-len 32768 \
--gpu-memory-utilization 0.90 --kv-cache-dtype fp8_e4m3 \
--port 8000 --trust-remote-code
Startup logs:
TurboQuant 4-bit KV Cache Compression
arXiv:2504.19874 (Zandieh et al., 2025)
Applying TurboQuant 4-bit patches to vLLM v1...
TQ4 page: 17,408 bytes/block (vs 65,536 standard, 3.8x more blocks)
Patch page_size : OK
Patch max_memory : OK
TurboQuant 4-bit patches active
GPU KV cache size: 126,240 tokens
Maximum concurrency for 32,768 tokens per request: 3.85x
Results
| Metric | Before (FP16) | After (TQ4) |
|---|---|---|
| max_model_len | 8,192 | 32,768 |
| KV cache tokens | ~25K | 126,240 |
| Concurrent 32K requests | 0.77x | 3.85x |
| Page size per block | 65,536 B | 17,408 B |
The reasoning model now serves its full configured context. Chain-of-thought inference -- the <think> blocks that DeepSeek-R1 produces -- routinely generates 1K+ tokens of internal reasoning. At the old 8K limit, complex multi-step problems were being silently truncated. At 32K, the model has room to think.
Automated Cloud Deployment
We updated the cloud provisioning pipeline so future reasoning instances automatically get TurboQuant. The cloud_node_profiles.yaml reasoning profile now includes:
turboquant:
enabled: true
bits: 4
deploy_files:
- lib/gpu/turboquant/
- scripts/tq-vllm-server.py
When deploy_from_profile("reasoning") provisions a new Vast.ai instance, it SCPs the TurboQuant files to the instance and restarts vLLM with the TQ wrapper. The same 72 KB tarball, the same launcher script, the same result: 3.8x more KV cache blocks.
The cloud reasoning endpoint is bridged to the local system via an SSH tunnel (localhost:8176 -> Vast.ai:8000), managed by ReasoningNodeManager which auto-discovers running instances, establishes tunnels, monitors health, and restarts on failure. Docker containers reach it via host.docker.internal:8176.
The Open-Source Package: aither-kvcache
Everything described in this post is available as a pip-installable Python package:
pip install aither-kvcache # Core library (PyTorch, NumPy)
pip install aither-kvcache[triton] # + Triton GPU kernels
pip install aither-kvcache[vllm] # + vLLM plugin integration
pip install aither-kvcache[all] # Everything
The package is aither-kvcache, licensed CC-BY-4.0. It provides:
Standalone quantizer -- works anywhere PyTorch runs:
from turboquant import TurboQuant
tq = TurboQuant(head_dim=128, bits=4, device="cuda")
packed, norms = tq.encode(kv_vectors) # [..., 128] -> [..., 64] uint8 + [...] f32
decoded = tq.decode(packed, norms) # [..., 64] uint8 + [...] f32 -> [..., 128] f16
print(tq.validate()) # MSE vs theoretical bounds
print(tq.memory_report(seq_len=40000, num_layers=32, num_kv_heads=8))
Native vLLM plugin -- zero-config integration via vLLM's entry_points system:
# In aither-kvcache's pyproject.toml:
[project.entry-points."vllm.plugins"]
aither_kvcache = "turboquant.vllm.plugin:register"
When aither-kvcache[vllm] is installed in the same environment as vLLM, the plugin auto-registers at startup in every process -- API server and engine workers -- with no monkey-patching, no sitecustomize hooks, and no import hacks. vLLM's plugin system loads the entrypoint, which calls register_backend(AttentionBackendEnum.CUSTOM, "turboquant.vllm.backend.TurboQuantBackend"). The engine page-size patches and KV cache management layer on top.
This is the clean path that replaces the sitecustomize approach described earlier. Install the package, set AITHER_TQ_BITS=4, start vLLM. That is it.
The internal AitherOS integration (lib/gpu/turboquant/) and the published package share the same core code. The internal version adds AitherOS-specific wiring: tiered cache integration, vLLM entrypoint scripts, and the cloud deployment pipeline. The package is the portable, dependency-light version that works with any vLLM installation.
On Blackwell (SM_120), set AITHER_TQ_FORCE_TRITON=1 and AITHER_TQ_EAGER=0 for full speed with torch.compile and CUDA graphs. For split-k tuning, AITHER_TQ_SPLITK_THRESHOLD=64 and AITHER_TQ_SPLITK_SPLITS=16 are the benchmarked defaults for RTX 5090 (170 SMs).
What We Actually Built
Starting from a paper (arXiv:2504.19874) and ending with 589 tok/s aggregate (20 concurrent) fused decode on Blackwell:
| Component | Status | Detail |
|---|---|---|
| TQ core library (codebook, rotation, packing, quantizer) | Production | ~2,200 lines |
| Fused Triton GPU kernels (2-bit, 4-bit quant+pack) | Production | 92+ tests passing |
| Fused TQ decode kernel (rotated-domain, split-k, Triton) | Production | cosine=1.000000 |
| TQGPUCache (5D contiguous VRAM storage) | Production | RTX 5090, 36 layers |
| DDR5 cold tier (bulk spill/warm, 1 GB/s) | Production | pinned memory, 4 ops |
| TurboQuantImpl (fused decode + Triton prefill) | Production | vLLM backend, CUDA graph safe |
| TieredKVCache (async cold tier) | Production | 36 layers active |
| vLLM sitecustomize hooks | Production | Both processes patched |
| Attention impl wrapping (isinstance-safe) | Production | 36 impls wrapped |
| torch.compile + CUDA graph compatibility | Production | 26 -> 64 tok/s |
| Fused Triton KV update kernel | Production | Single launch encode+store |
| Hybrid TQ35/TQ25 quantizer (QJL residual) | Production | 3.5/2.5 avg bits |
| Cloud GPU deployment (Vast.ai auto-provision) | Production | Live on RTX 3090 |
| SSH tunnel automation (ReasoningNodeManager) | Production | Auto-discovery + health |
aither-kvcache PyPI package | v2.0.0 | pip install aither-kvcache |
| PRIMARY fused decode (no decompress buffer) | Production | Reuses SHADOW kernel on primary cache |
| Graphable fused decode (CUDA graph capture) | Production | 87.9 -> 224.1 tok/s at 5 concurrent |
| Performance optimizations (rotation cache, buffer reuse, split-k 64/16) | Production | 288.6 tok/s peak at 10 concurrent |
| vLLM native plugin (entry_points registration) | Production | Zero-config |
torch.library.custom_op registration | Production | Zero graph breaks on decode hot path |
| KVCacheGraph (block relationship graph) | Production | 5 edge types, graph-aware eviction |
| GraphEvictionAdvisor (background pre-compute) | Production | Zero lock contention on decode path |
| tq-t4nc deployment (vLLM upstream TQ backend) | Production | 589 tok/s at 20 concurrent |
174 unit tests + 38 integration tests passing. The orchestrator serves inference at 309K token capacity with the 3-tier cache active. The cloud reasoning model serves at 126K tokens with TQ4 page compression. Fused decode runs at 589 tok/s aggregate (20 concurrent) on Blackwell -- validated at 84.5% GSM8K 5-shot accuracy (200 samples). Hybrid TQ35 mode fits 6% more blocks in the same VRAM budget with better reconstruction quality.
What We Learned
The algorithm is sound. Our measured MSE ratios (1.86x to 2.39x of the information-theoretic lower bound) validate the paper's claims with margin to spare. The data-oblivious property is real -- we never ran a calibration step, and the distortion matches theory on random vectors.
The implementation is portable. The pure PyTorch path runs on CPU without modification. The Triton kernels require CUDA. Everything passes on both paths. The codebook is 16 floating-point numbers. The rotation matrix is a one-time 128x128 matrix multiply. The packing is bit shifts. There is nothing expensive here.
The vLLM integration is where the complexity lives. The core algorithm was a weekend project. Making vLLM respect a non-standard KV cache format is a systems engineering problem that touches process boundaries, memory allocation, tensor layout conventions, and CUDA kernel interfaces. The distance between "my quantizer works" and "my quantizer serves production traffic" is significant.
Monkey-patching multi-process Python is fragile but works. The sitecustomize.py approach -- intercepting module imports before they happen, across process boundaries -- is not elegant. But it is the correct strategy for vLLM v0.15's architecture, where both the API server and engine workers independently import the KV cache specification.
CUDA graphs require architectural discipline. You cannot have dynamic control flow inside a function that CUDA graphs will capture. The initial fix (is_current_stream_capturing() bypass) doubled throughput to 64 tok/s. But the real unlock came from making the fused Triton kernel itself graphable -- extracting the hot path into a standalone method with no Python control flow, no lazy init, no try/except. That let CUDA graphs capture the TQ kernel directly instead of falling back to standard attention. Combined with split-k tuning for Blackwell's SM count, it went from 64 to 224 tok/s. The lesson: design for graph capture from the start, and make the custom kernel graphable, not just the fallback.
Test on the target hardware. Three of our five production bugs -- shadow cache gating, env passthrough, OOM on allocation -- only manifested on the RTX 5090 in Docker. Unit tests on CPU caught the math. Integration tests on the actual GPU caught the systems engineering.
TQ is now both a capacity win AND a throughput win. Earlier versions traded throughput for capacity -- TQ PRIMARY ran at 51 tok/s (5 concurrent) vs fp8's 120-150 tok/s. The progression tells the story: v0.8.0 graphable fused decode hit 224 tok/s, v0.9.1 eliminated graph breaks for 40 tok/s single-request, v1.0 registered torch.library.custom_op for zero graph breaks on the decode hot path, and v1.3 with tq-t4nc hit 589 tok/s at 20 concurrent -- surpassing fp8 throughput while maintaining 3.8x memory compression. You get 309K tokens vs ~80K AND faster decode. The turning point was making every TQ operation graphable and tuning split-k parallelism to saturate the GPU.
Profile the right bottleneck. The initial fused decode showed no improvement over batch-decompress at short context because both paths spend <1ms on attention -- the MLP layers dominate. The decompress bottleneck only manifests at 128+ active blocks (2K+ context). The fused path's real win is eliminating the 1.2 GB decompression buffer allocation and scaling linearly with the Triton kernel instead of quadratically with Python-side gather+decode+scatter.
The core library is published as aither-kvcache on PyPI -- pip install aither-kvcache[vllm] and you are done. Codebook, rotation, packing, quantizer, fused attention, Triton kernels, graph-aware eviction. 174+ tests passing. No external dependencies beyond PyTorch. CC-BY-4.0. The paper is Zandieh et al., arXiv:2504.19874.
309K tokens hot on the local orchestrator. 3.9 million tokens addressable from DDR5. 126K tokens on a $0.12/hr cloud reasoning GPU. 589 tokens per second aggregate decode on Blackwell at 20 concurrent. 84.5% GSM8K accuracy. On a single consumer GPU.
Update: v0.8.1 — TQ35-PRIMARY and the CUDA Graph Trap
March 31, 2026
Two developments since v0.8.0 shipped.
TQ35-PRIMARY mode: 329K tokens. The hybrid quantizer (TQ35) was SHADOW-only — it encoded a shadow copy while vLLM's fp8 cache handled attention. That left capacity on the table. v0.8.1 adds tq35-primary: the page_size and reshape patches now handle hybrid layouts (norms embedded in packed data, no separate +4 bytes), so vLLM allocates blocks at the hybrid compressed size. The result is 329,072 tokens on the same 0.55 GPU utilization that gave 282K with TQ4 and 150K with standard fp8. A 4x compression ratio over standard page sizes. The entrypoint accepts AITHER_TQ_MODE=tq35-primary alongside the existing tq4-primary.
The catch: hybrid mode cannot use the fused TQPagedAttention kernel (which requires separate float32 norms). TQ35-PRIMARY falls back to decompress-then-attend, which is correct but slower (~2-5 tok/s single). For throughput-sensitive workloads, TQ4-PRIMARY with the fused kernel remains the better choice. TQ35-PRIMARY is for when you need maximum context window and can tolerate lower throughput — long document ingestion, batch summarization, retrieval-heavy agents.
The CUDA graph diagnosis. After a speculative decoding experiment was added and reverted, TQ4-PRIMARY started producing <think>fff... gibberish. The code was identical to the working baseline. The torch.compile cache was cleared. Yet gibberish persisted. The root cause turned out to be the CUDA graph capture itself — not stale caches, not code changes. Running with AITHER_TQ_EAGER=1 (which sets --enforce-eager, disabling torch.compile and CUDA graphs entirely) immediately produced perfect output: "The capital of France is Paris." at 309K tokens.
The lesson: when CUDA graphs replay compiled graphs that include custom attention backends, the graph capture can bake in tensor states or control flow assumptions that break when the model's attention pattern changes between capture and replay. The TQ fused kernel is pure tensor ops and should be graphable — but something in the capture-replay cycle is corrupting the output. The enforce-eager workaround gives ~6-8 tok/s single request, which is usable while we investigate the graph capture issue. The previous 224 tok/s numbers were with working CUDA graphs, and we expect to recover that throughput once the capture bug is identified.
| Milestone | Status | Key metric |
|---|---|---|
| TQ35-PRIMARY mode | Production | 329K tokens (4x compression) |
| CUDA graph diagnosis | In progress | Eager fallback: 6-8 tok/s |
| Eager-mode TQ4-PRIMARY | Production | 309K tokens, coherent output |
Current recommended configuration: AITHER_TQ_MODE=tq4-primary with AITHER_TQ_EAGER=1. 309K tokens, coherent output, no gibberish. CUDA graph throughput recovery is the next target.
329K tokens hot on a single consumer GPU. The ceiling keeps moving.
Update: v0.9.1 — The Graph Break Massacre (11 → 40 tok/s)
April 2, 2026
The v0.8.1 eager fallback gave us coherent output at 6-8 tok/s single request, climbing to 11 tok/s after the three SDPA bugs were fixed (wrong causal mask, missing GQA expansion, degenerate output from is_causal=True with q_len < kv_len). But 11 tok/s on hardware that does 185 tok/s natively was still unacceptable. Time to profile.
The custom backend was never the right path. vLLM v0.15's custom attention backend registration triggers an Inductor tensor corruption bug at splitting op boundaries. Our previous approach — registering TurboQuantBackend as CUSTOM — required enforce-eager to avoid gibberish. The alternative: don't register a backend at all. Monkey-patch TritonAttentionImpl.forward() to hook TQ encode/decode into the standard Triton backend. vLLM keeps its standard attention code path, torch.compile works normally, and TQ operations run as surgical graph breaks within the compiled forward pass.
The profiling told us exactly where to look. One decode token was taking ~95ms across 36 transformer layers. Each layer's TQ forward created graph breaks — transitions from compiled code to Python and back. The question was how many.
The answer was 144 graph breaks per token. Per layer:
_tq_init_layer()—@torch.compiler.disableeven though it's a no-op after the first call_ensure_quantizer()— same: no-op after init, still creates a graph break_ensure_norms()— same_tq_encode_phase()— encodes new K/V token into TQ cache_tq_fused_decode()— fused attention from compressed cache
Three of those are doing nothing (after warmup) but still forcing Python entry/exit on every call to every layer. And the two that do real work ran as separate graph breaks, meaning 2 Python transitions for work that should be 1.
Fix 1: Guard the init functions. Replace unconditional calls with Python-level checks (if _tq_quantizer is None: ...). After warmup, these are just Python if statements — no @torch.compiler.disable, no graph break. Eliminated 108 graph breaks per token.
Fix 2: Merge encode + decode. Created _tq_decode_step() — a single @torch.compiler.disable function that calls encode and fused attention together. Instead of exiting compiled code for encode, returning, re-entering compiled code, exiting again for decode, returning — it's one exit and one return. Halved the remaining graph breaks from 72 to 36.
Fix 3: Kill the CPU-GPU syncs. The encode path used .any() to check if slot mappings were valid. In PyTorch, tensor.any() in a Python if triggers an implicit .item() — a CUDA-to-CPU synchronization that blocks the GPU pipeline. Two .any() checks per layer per token = 72 synchronization stalls per generated token, each costing ~3-5 microseconds of pipeline bubble.
For decode, this is absurd. Decode always has exactly 1 token with exactly 1 valid slot. The fix: remove all boolean masking and branching from the encode path. Direct index arithmetic: bi = slot_mapping // block_size, oi = slot_mapping % block_size. No masks, no .any(), no CPU-GPU sync.
The numbers:
| Metric | Before | After |
|---|---|---|
| Graph breaks per token | 144 | 36 |
| CPU-GPU syncs per token | 72 | 0 |
| L0 merged decode time | 29.34 ms | 1.91 ms |
| Single-request tok/s | 11.0 | 40.1 |
| Wall time for 256 tokens | 24.3s | 6.4s |
3.6x speedup from eliminating overhead, not from faster GPU kernels. The GPU was always fast enough — we were drowning it in Python round-trips.
The remaining 36 graph breaks (one per layer, for the merged encode+fused decode) are load-bearing — the TQ encode uses operations that torch.compile can't trace (dynamic codebook lookups, Triton kernel launches). Eliminating those requires registering TQ operations as torch.library.custom_op, which would make the entire decode path graphable. That is the path to recovering the full 224 tok/s concurrent throughput from v0.8.0 — but 40 tok/s single-request is already serviceable for interactive use.
| Milestone | Status | Key metric |
|---|---|---|
| Hook-based integration (no custom backend) | Production | torch.compile compatible |
| Merged decode step | Production | 144 → 36 graph breaks |
| Branchless encode | Production | 72 → 0 CPU-GPU syncs |
| Single-request throughput | Production | 40 tok/s (was 11) |
aither-kvcache v0.9.1 | Released | pip install aither-kvcache |
Current recommended: AITHER_TQ_MODE=tq4-primary. No enforce-eager needed -- the hook path doesn't register a custom backend, so torch.compile works normally. 309K tokens, coherent output, 40 tok/s.
The lesson: profile your graph breaks, not your kernels. The GPU was waiting for Python 144 times per token.
Update: v1.0 -- Zero Graph Breaks (custom_op Registration)
April 3, 2026
v0.9.1 left 36 graph breaks per token -- one per layer, for the merged encode+fused decode step. The encode path uses dynamic codebook lookups and Triton kernel launches that torch.compile cannot trace. The solution was always clear: register TQ operations as torch.library.custom_op, making them opaque to the compiler but callable inside compiled graphs.
Two custom ops:
@torch.library.custom_op("tq::decode_step", mutates_args=())
def tq_decode_step(query, k_packed, k_norms, v_packed, v_norms,
rotation, centroids, scale, num_blocks, block_size):
# Fused encode + rotated-domain attention
...
@torch.library.custom_op("tq::hybrid_decode_step", mutates_args=())
def tq_hybrid_decode_step(query, k_packed, k_norms, v_packed, v_norms,
rotation, centroids_hi, centroids_lo, scale,
num_blocks, block_size, split_dim):
# Same, but for TQ35/TQ25 hybrid layouts
...
torch.compile treats these as black boxes -- it does not try to trace inside them, does not generate graph breaks around them, and CUDA graphs capture them as opaque kernel launches. The compiler sees a pure function with known input/output shapes and types, which is all it needs to schedule them within the compiled graph.
The result: zero graph breaks on the decode hot path. The entire forward pass -- embedding, RMSNorm, QKV projection, TQ encode, fused attention, output projection, MLP, all 36 layers -- compiles into a single CUDA graph. No Python round-trips during decode.
Both uniform modes (tq4, tq3, tq2) and hybrid modes (tq35, tq25) are graphable. The hybrid op takes separate high/low centroid tables and a split dimension, matching the QJL residual architecture from TQ35.
Alongside the custom ops, we built KVCacheGraph -- a 520-line faculty graph that models KV cache blocks as graph nodes.
KVCacheGraph: Blocks as a Relationship Graph
Standard KV cache eviction is FIFO or LRU -- evict the oldest or least-recently-used block. This ignores structure. A system prompt block shared by 50 concurrent requests is more valuable than a single-use generation block, regardless of access time. A block that is always co-attended with the current working set should be prefetched, not evicted.
KVCacheGraph models five types of relationships between KV blocks:
| Edge type | Meaning | Source |
|---|---|---|
prefix_share | Block reused across requests (prefix caching) | vLLM block allocator |
co_attend | Blocks frequently attended together | Attention metadata |
semantic | Similar key vector embeddings | Embedding distance |
temporal | Sequential in the same generation | Token position |
spill_link | Hot VRAM block linked to cold DDR5 copy | Tier cache bridge |
Each block is a KVBlockNode with an importance score, source label (system, user, assistant, tool, generation), token range, and access count. The graph supports:
- Graph-aware eviction:
suggest_eviction(n_blocks)returns the least-connected, lowest-importance subgraph. System prompt blocks and blocks with highprefix_sharedegree are protected. - Graph-aware prefetch:
suggest_prefetch(active_blocks)returns cold-tier blocks that are graph-neighbors of the current working set -- blocks likely to be needed next. - Cross-session KV reuse: Blocks tagged with
prefix_shareedges can be identified as reusable across requests without hashing the actual KV data.
The graph is standalone -- no AitherOS dependencies. It ships as part of the aither-kvcache package.
| Milestone | Status | Key metric |
|---|---|---|
tq::decode_step custom_op | Production | Zero graph breaks (uniform modes) |
tq::hybrid_decode_step custom_op | Production | Zero graph breaks (hybrid modes) |
| KVCacheGraph (5 edge types) | Production | 520 lines, graph-aware eviction |
| Full CUDA graph capture (all TQ modes) | Production | 7/7 graphs captured |
| RTX 5090 validation | Production | 23.6 tok/s single, 120 tok/s 5x concurrent |
aither-kvcache v1.0 | Released | 174 unit tests + 38 integration tests |
Update: v1.3 -- Graph Eviction in Production (589 tok/s)
April 5, 2026
Two things shipped.
GraphEvictionAdvisor. The KVCacheGraph's suggest_eviction() is O(n log n) in the number of blocks -- fine for batch decisions, too slow for the decode hot path where eviction decisions happen per-token. The advisor runs as a background thread that pre-computes eviction and prefetch rankings every 500ms, storing them as an atomic reference. The decode path reads the pre-computed list with zero lock contention:
advisor = GraphEvictionAdvisor(graph, interval=0.5, eviction_batch=256)
advisor.start()
# Hot decode path -- zero blocking:
candidates = advisor.get_eviction_candidates(n=16) # atomic read
prefetch = advisor.get_prefetch_candidates(active, n=8) # atomic read
If the ranking is stale (> 2 seconds since last recompute), the advisor returns None and the caller falls back to FIFO. In practice, the background thread finishes in < 50ms per cycle, so staleness never triggers during normal operation.
The advisor auto-loads as a vLLM general plugin via vllm.general_plugins entry point -- install aither-kvcache[vllm] and the graph eviction advisor starts automatically alongside the attention backend. No configuration needed.
tq-t4nc deployment. We switched from our monkey-patched TQ integration to a cleaner path: vLLM's upstream TurboQuant backend support (originally from PR #38479 in the vLLM repo). The tq-t4nc (TurboQuant for T4/NC -- "native compilation") image installs aither-kvcache==1.3.1 and uses the graph eviction plugin as a general plugin entry point rather than attention backend hooks. The result is a simpler, more maintainable integration that plays nicely with vLLM's internal torch.compile pipeline.
Benchmark: Nemotron-8B-AWQ on RTX 5090
We ran a comprehensive benchmark suite covering throughput, concurrency scaling, and quality:
Throughput (tq4-primary, gpu_memory_utilization=0.42):
| Scenario | Result |
|---|---|
| Single request, cold, 32 tok | 34.1 tok/s |
| Single request, cold, 256 tok | 33.2 tok/s |
| Single request, cold, 1024 tok | 31.7 tok/s |
| Single request, cold, 4096 tok | 38.1 tok/s |
| 5x concurrent, cold, 256 tok | 174.8 tok/s aggregate |
| Median throughput | 37.8 tok/s |
| 20x concurrent | 589 tok/s aggregate |
Quality (GSM8K 5-shot, 200 samples):
| Metric | Value |
|---|---|
| GSM8K accuracy | 84.5% |
| Extraction method | Think-tag aware (extracts from <think> blocks) |
| Tool calling validation | 10/10 passed |
84.5% on GSM8K with a 4-bit quantized 8B model running through sub-byte KV cache compression. The quantization noise from TQ4 does not measurably degrade reasoning quality -- the model produces identical chain-of-thought traces whether the KV cache is FP8 or TQ4.
The scaling curve is revealing. Single-request throughput is 34-38 tok/s, bounded by the model's MLP layers (same bottleneck with or without TQ). At 5x concurrency, the GPU starts to saturate on attention -- 175 tok/s aggregate. At 20x, we hit 589 tok/s because the GPU is fully batching across all concurrent requests, and the TQ fused kernel's per-token work (codebook lookup + dot product) is trivial compared to the attention computation itself. Above 20x, throughput drops because the KV cache blocks start contending for VRAM bandwidth.
The progression from paper to production:
| Version | Date | Milestone | Key metric |
|---|---|---|---|
| v0.1 | Mar 20 | Core library + Triton kernels | MSE within 2.7x bound |
| v0.5 | Mar 25 | TieredKVCache + vLLM integration | 280K tokens, 26 tok/s |
| v0.7 | Mar 27 | Graphable fused decode | 87.9 tok/s (5 concurrent) |
| v0.8 | Mar 28 | Split-k + rotation cache + buffer reuse | 224 tok/s (5 concurrent) |
| v0.8.1 | Mar 31 | TQ35-PRIMARY hybrid mode | 329K tokens |
| v0.9.1 | Apr 2 | Graph break massacre | 40 tok/s single (was 11) |
| v1.0 | Apr 3 | torch.library.custom_op | Zero graph breaks |
| v1.3 | Apr 5 | Graph eviction + tq-t4nc | 589 tok/s (20 concurrent) |
Current recommended configuration:
AITHER_TQ_MODE=tq4-primary
AITHER_TQ_FORCE_TRITON=1
AITHER_TQ_EAGER=0
Install aither-kvcache[vllm]>=1.3.1 in your vLLM environment. The graph eviction advisor loads automatically. Start vLLM with --attention-backend TRITON_ATTN. 309K tokens, 589 tok/s at 20 concurrent, 84.5% GSM8K accuracy. On a single consumer GPU that costs less than a cloud API call.
309K tokens hot. 3.9 million tokens addressable from DDR5. 589 tokens per second. The ceiling keeps moving.
Update: v2.0.0 -- TriAttention: Spectral KV Compression (9.85x in 26 Bytes)
April 6, 2026
TurboQuant compresses KV vectors with vector quantization -- codebooks, rotations, bit-packing. It is data-oblivious by design: the codebook is fixed, the rotation is fixed, the compression ratio is fixed. That is its strength (no calibration) and its limitation (the compression ratio ceiling is set by the codebook size, not by the data).
TriAttention takes a fundamentally different approach: exploit the mathematical structure of RoPE attention itself.
The Core Insight
Rotary Position Embedding (RoPE) encodes position by rotating key/query pairs through frequency-dependent angles. The attention score between query at position m and key at position n decomposes as:
score(m, n) = sum_{f=1}^{d/2} E_f * cos(theta_f * (m - n) + phi_f)
where theta_f = base^(-2f/d) are the RoPE frequencies, E_f is the energy (amplitude) of pair f, and phi_f is its phase. This is a trigonometric series in the position difference (m - n).
Most of the energy concentrates in a small number of frequency pairs. For a typical Qwen3.5 layer with head_dim=128 (64 frequency pairs), the top 12 pairs carry >95% of the spectral energy. The other 52 pairs contribute noise-level corrections.
What TriAttention Stores
Instead of storing the full 128-dimensional key vector (256 bytes in FP16), TriAttention stores only the top-F frequency pairs:
| Component | Per pair | Total (F = 12) |
|---|---|---|
| Frequency index | -- | 12 x 1 byte = 12 B |
| Cosine coefficient | int4 | 12 x 0.5 byte = 6 B |
| Sine coefficient | int4 | 12 x 0.5 byte = 6 B |
| Scale + norm | float16 | 2 B |
| Total per token | 26 bytes |
Compare: FP16 stores 128 dimensions x 2 bytes = 256 bytes per token. TriAttention at F=12, int4 coefficients: 26 bytes per token. That is a 9.85x compression ratio -- more than double TQ4's 3.8x.
The Compression Ladder
| Method | Bytes/token | Compression | Quality | Calibration |
|---|---|---|---|---|
| FP16 (baseline) | 256 | 1.0x | Exact | None |
| FP8 (vLLM default) | 128 | 2.0x | Near-exact | None |
| TQ4 (TurboQuant) | 68 | 3.8x | Good | None |
| TQ35 (hybrid) | 64 | 4.0x | Better | None |
| TQ25 (aggressive) | 44 | 5.8x | Acceptable | None |
| TriAttention (F=16, int8) | 50 | 5.1x | High | Per-model |
| TriAttention (F=12, int4) | 26 | 9.85x | Good | Per-model |
| TriAttention (F=8, int4) | 18 | 14.2x | Acceptable | Per-model |
TriAttention trades TurboQuant's data-oblivious simplicity for higher compression. The tradeoff: TriAttention requires per-model calibration profiles (which frequency pairs matter, how many to keep). TurboQuant works on any model with no calibration. Both ship in the same package.
Qwen3.5 Calibration Profiles
TriAttention ships with pre-computed spectral profiles for the Qwen3.5 model family -- the same models that power our orchestrator and reasoning pipeline:
| Model | Layers | KV Heads | Default F | Compression |
|---|---|---|---|---|
| Qwen3.5-0.6B | 28 | 2 | 10 | 11.6x |
| Qwen3.5-1.7B | 28 | 4 | 12 | 9.85x |
| Qwen3.5-4B | 36 | 8 | 12 | 9.85x |
| Qwen3.5-8B | 36 | 8 | 14 | 8.5x |
| Qwen3.5-14B | 40 | 8 | 14 | 8.5x |
| Qwen3.5-32B | 64 | 8 | 16 | 7.5x |
| Qwen3.5-30B-A3B | 48 | 4 | 12 | 9.85x |
Each profile includes per-layer frequency schedules -- early layers (which capture broad semantic patterns) keep fewer frequencies, while deep layers (which refine token-level predictions) keep more. The profiles are generated by calibration.py and require no user data -- they are computed from the model's RoPE configuration and architectural constants.
Scoring: Reconstructing Attention from Coefficients
At decode time, TriAttention reconstructs approximate attention scores directly from the stored trig coefficients -- no decompression to full vectors needed. The scorer evaluates:
s_hat(m, n) = sum_{f in top-F} a_hat_f * cos(theta_f * delta) + b_hat_f * sin(theta_f * delta)
where delta = m - n is the position difference, and a_hat_f, b_hat_f are the dequantized coefficients. This is O(F) per query-key pair instead of O(d) -- a 5-10x reduction in FLOPs for the score computation itself.
The scorer supports two modes:
- Pre-RoPE (default): Computes scores analytically from trig coefficients with RoPE phases folded in. No vector reconstruction.
- Post-RoPE: Reconstructs approximate key vectors and uses standard dot-product attention. Higher quality, higher cost. Useful as a reference for validation.
The Architecture
New token arrives
-> SpectralKVEncoder.encode():
1. Compute RoPE frequency energies for this token's K vector
2. Select top-F pairs by energy
3. Compute trig coefficients (cos/sin projections)
4. Quantize coefficients to int4/int8/int16
5. Store SpectralEncoding in SpectralKVCache (paged blocks)
Decode attention
-> TrigSeriesScorer.score():
1. Load stored coefficients for cached tokens
2. Evaluate trig series at position difference delta = q_pos - k_pos
3. Return approximate attention scores [B, H, 1, S]
-> Accumulate values via scatter/gather
-> Return output
The entire pipeline -- encode, cache, score, accumulate -- is pure PyTorch. No Triton kernels yet (that is the next step). The reference implementation validates correctness; the fused GPU path will follow the same progression as TurboQuant: CPU reference -> Triton encode -> fused decode kernel -> CUDA graph capture.
Using TriAttention
from aither_kvcache.triattention import TriAttention, TriAttentionConfig
# Default config: F=12, int4, head_dim=128 (Qwen3.5)
config = TriAttentionConfig()
tri = TriAttention(config, num_layers=36, max_seq_len=32768, device="cuda")
# Encode a KV pair
tri.encode_kv(layer_idx=0, key=k, value=v, position=pos)
# Decode attention
output = tri.forward(layer_idx=0, query=q, query_pos=pos)
# Or use a pre-calibrated Qwen3.5 profile:
from aither_kvcache.triattention.calibration import get_config_for_model
config = get_config_for_model("Qwen3.5-8B")
What Ships in v2.0.0
The aither-kvcache package is now at v2.0.0 on PyPI:
pip install aither-kvcache>=2.0.0
The package now contains two complementary compression engines:
| Engine | Approach | Compression | Calibration | GPU kernels |
|---|---|---|---|---|
| TurboQuant | Vector quantization (codebook + rotation) | 3.8-5.8x | None (data-oblivious) | Fused Triton (production) |
| TriAttention | Spectral decomposition (trig series) | 7.5-14.2x | Per-model profiles | Reference PyTorch (v0.1) |
Both engines share the same paged block cache infrastructure and vLLM plugin system. TurboQuant remains the recommended engine for production decode throughput (589 tok/s at 20 concurrent). TriAttention is the capacity play -- when you need maximum tokens per byte and can tolerate the reference scorer's throughput while the fused kernels are built.
The full module inventory for v2.0.0:
aither_kvcache.triattention.config-- TriAttentionConfig with compression ratio calculationsaither_kvcache.triattention.spectral-- RoPE frequency analysis, pair energies, topk selection, DCTaither_kvcache.triattention.encoder-- SpectralKVEncoder (4/8/16-bit coefficient quantization)aither_kvcache.triattention.scorer-- TrigSeriesScorer (pre-RoPE analytic, post-RoPE reference)aither_kvcache.triattention.cache-- SpectralKVCache (paged block storage, vLLM-compatible)aither_kvcache.triattention.attention-- TriAttention (full encode -> score -> accumulate pipeline)aither_kvcache.triattention.calibration-- Qwen3.5 family profiles (0.6B through 32B)
184 tests passing (152 TurboQuant + 32 TriAttention). The TriAttention monorepo test suite runs 69 tests covering spectral decomposition, encode/decode roundtrips, scorer accuracy, cache operations, calibration profiles, and full forward pass validation.
The Roadmap
TriAttention v0.1 is the mathematical foundation. The path forward mirrors TurboQuant's progression:
| Milestone | Target | What it unlocks |
|---|---|---|
| Fused Triton scorer | v2.1 | GPU-native trig series evaluation, 10x+ scorer speedup |
| Spectral cache integration with TieredKVCache | v2.2 | 3-tier spectral: VRAM + DDR5 + recompute |
| Hybrid TQ + TriAttention | v2.3 | TQ for recent tokens (fast decode), spectral for cold tokens (max compression) |
| CUDA graph capture for spectral decode | v2.4 | Full compiled pipeline, concurrent throughput |
The hybrid approach is the endgame: TurboQuant handles the hot working set (recent tokens, fast fused decode at 589 tok/s), while TriAttention compresses the cold tail (older context, 9.85x compression). The working set is small and speed-critical. The tail is massive and compression-critical. Two engines, one cache, each doing what it is best at.
The Progression
| Version | Date | Milestone | Key metric |
|---|---|---|---|
| v0.1 | Mar 20 | Core library + Triton kernels | MSE within 2.7x bound |
| v0.5 | Mar 25 | TieredKVCache + vLLM integration | 280K tokens, 26 tok/s |
| v0.7 | Mar 27 | Graphable fused decode | 87.9 tok/s (5 concurrent) |
| v0.8 | Mar 28 | Split-k + rotation cache + buffer reuse | 224 tok/s (5 concurrent) |
| v0.8.1 | Mar 31 | TQ35-PRIMARY hybrid mode | 329K tokens |
| v0.9.1 | Apr 2 | Graph break massacre | 40 tok/s single (was 11) |
| v1.0 | Apr 3 | torch.library.custom_op | Zero graph breaks |
| v1.3 | Apr 5 | Graph eviction + tq-t4nc | 589 tok/s (20 concurrent) |
| v2.0 | Apr 6 | TriAttention spectral compression | 9.85x compression, 184 tests |
aither-kvcache v2.0.0 is live on PyPI. pip install aither-kvcache and you get both engines. 184 tests passing. CC-BY-4.0.
309K tokens hot on TurboQuant. 9.85x compression on TriAttention. The ceiling keeps moving.