DFlash: Block Diffusion Speculative Decoding — 6× Faster Inference Without Losing a Single Token
DFlash: Block Diffusion Speculative Decoding — 6× Faster Inference Without Losing a Single Token
April 13, 2026 · AitherOS Engineering
There is a bottleneck in autoregressive LLM inference that no amount of hardware can fix: the model generates one token at a time. Each token requires a full forward pass through every layer. The GPU is starved — matrix multiplications that could saturate 1,000 TFLOPS of Blackwell tensor cores instead touch a few hundred bytes of KV state and wait. This is the decode bottleneck, and it is the reason a 32GB RTX 5090 running Nemotron-8B-AWQ spends 94% of its inference time memory-bound.
Speculative decoding is the standard fix: use a small "draft" model to predict N tokens ahead, then verify all N in a single forward pass of the target model. If the draft guesses right, you get N tokens for the price of one. The problem is that traditional draft models (EAGLE, Medusa, Lookahead) are still autoregressive — they predict token 2 given token 1, token 3 given token 2, and so on. The draft itself is sequential. You replaced one sequential bottleneck with a smaller sequential bottleneck.
Block diffusion changes the game. Instead of predicting tokens one at a time, you predict all N tokens simultaneously using iterative denoising. Start with N masked positions, denoise them in parallel across 4 steps, and produce a complete draft block. One forward pass per denoising step, not one per token. This is the key insight from Arriola et al.'s "Block Diffusion" work, and the results from the community have been dramatic: 6× lossless speedup on Qwen3.5-35B, 2.5× faster than EAGLE-3, with a draft model under 100MB.
We decided to build it. Not wrap a library. Not call an API. Build the entire pipeline from the attention layers to the Triton hooks, calibrate it on our own traffic, and wire it into the vLLM process that serves every request in AitherOS. This is the full story.
The Architecture
DFlash — our implementation of block diffusion speculative decoding — consists of six components:
- Draft model: A lightweight transformer that takes masked token sequences and iteratively denoises them
- Noise schedule: Controls how many positions to unmask at each diffusion step
- KV cache extractor: Pulls context from the target model's KV cache (including TurboQuant-compressed caches)
- Scorer: Rejection sampling verifier that guarantees lossless output
- vLLM proposer: Adapter that speaks vLLM's speculative decoding interface
- Import hooks: Runtime monkey-patching that wires DFlash into vLLM at process startup
The entire system lives in AitherOS/lib/gpu/dflash/ — 8 Python files, approximately 1,600 lines of code.
The Draft Model
The draft model is a small transformer with one critical difference from standard language models: bidirectional self-attention. In autoregressive generation, each token can only attend to previous tokens (causal masking). In block diffusion, every position needs to see every other position — the model must reason about the entire block simultaneously to decide which tokens to unmask.
class DFlashSelfAttention(nn.Module):
"""Bidirectional multi-head attention — no causal mask."""
def __init__(self, d_model: int, num_heads: int, dropout: float = 0.0):
super().__init__()
self.num_heads = num_heads
self.head_dim = d_model // num_heads
self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
self.out = nn.Linear(d_model, d_model, bias=False)
self.q_norm = nn.RMSNorm(self.head_dim)
self.k_norm = nn.RMSNorm(self.head_dim)
def forward(self, x: torch.Tensor) -> torch.Tensor:
B, L, _ = x.shape
qkv = self.qkv(x).reshape(B, L, 3, self.num_heads, self.head_dim)
q, k, v = qkv.unbind(dim=2)
q = self.q_norm(q).transpose(1, 2)
k = self.k_norm(k).transpose(1, 2)
v = v.transpose(1, 2)
# is_causal=False — the key difference from autoregressive attention
out = F.scaled_dot_product_attention(q, k, v, is_causal=False)
return self.out(out.transpose(1, 2).reshape(B, L, -1))
Each layer combines bidirectional self-attention with cross-attention to the target model's hidden states, followed by a SwiGLU feed-forward network:
Input → Self-Attention (bidirectional) → + residual
→ Cross-Attention (draft queries → target KV) → + residual
→ SwiGLU FFN → + residual → Output
The cross-attention is what gives the draft model context. Instead of re-encoding the entire prefix, we extract the target model's key-value representations and let the draft model attend to them directly. This means the draft model never needs to process the prompt — it only processes the N masked positions.
The MASK Token
Block diffusion uses an absorbing-state formulation. The vocabulary has one additional token: [MASK], representing "unknown." At timestep 0, all N draft positions are [MASK]. At each denoising step, some positions are unmasked (replaced with predicted tokens). By the final step, all positions should contain real tokens.
We map MASK_TOKEN_ID = -1 internally and remap it to vocab_size at embedding time:
def forward(self, token_ids, timestep, target_hidden):
# Remap [MASK] tokens to the extra embedding row
ids = token_ids.clone()
ids[ids == MASK_TOKEN_ID] = self.config.vocab_size # Last row = [MASK] embedding
x = self.tok_embed(ids) + self.pos_embed[:, :ids.size(1)]
This avoids colliding with any real vocabulary token while keeping the embedding table clean.
The Noise Schedule
The noise schedule determines how many positions to unmask at each denoising step. We implement three schedules:
- Cosine: Aggressive at first, gentle at the end. Unmasks most positions in the first 2 steps. Good for high-confidence drafts.
- Linear: Even distribution across steps. Predictable behavior.
- Square root: Gentle at first, aggressive at the end. Gives the model more time to reason about ambiguous positions early.
def cosine_schedule(T: int, device="cpu") -> torch.Tensor:
"""Returns mask ratios [1.0 → 0.0] over T steps (cosine curve)."""
t = torch.linspace(0, 1, T + 1, device=device)
alphas = torch.cos(t * (math.pi / 2))
return alphas[:T]
Given a schedule and a block size of N, we compute unmask_order — how many positions to reveal at each step:
def compute_unmask_order(block_size: int, num_steps: int, schedule_fn) -> list[int]:
ratios = schedule_fn(num_steps)
counts = [round(block_size * r.item()) for r in ratios]
# Ensure counts sum to exactly block_size
counts = adjust_to_sum(counts, block_size)
return [counts[i] - counts[i+1] for i in range(len(counts)-1)] + [counts[-1]]
With block_size=16 and 4 cosine steps, the typical unmask order is [6, 5, 3, 2] — unmask 6 positions in step 1, 5 in step 2, and so on.
The Denoising Loop (Vectorized)
The core inference loop runs 4 denoising steps. At each step, the model predicts logits for all masked positions, we sample from the most confident predictions, and unmask those positions. The critical implementation detail: no Python for-loops over batch dimensions or positions. Everything is vectorized with PyTorch scatter/gather ops.
def draft(self, prefix_ids, target_hidden, temperature=1.0):
B = prefix_ids.size(0)
block = self.config.block_size
device = prefix_ids.device
# Start with all [MASK]
draft_ids = torch.full((B, block), MASK_TOKEN_ID, device=device)
unmask_counts = compute_unmask_order(block, self.config.diffusion_steps,
cosine_schedule)
for step, n_unmask in enumerate(unmask_counts):
t = torch.tensor([step], device=device).expand(B)
logits = self.forward(draft_ids, t, target_hidden)
# Confidence = max log-probability at each position
log_probs = F.log_softmax(logits / temperature, dim=-1)
max_log_probs, _ = log_probs.max(dim=-1) # (B, block)
# Only consider still-masked positions
is_masked = (draft_ids == MASK_TOKEN_ID)
max_log_probs[~is_masked] = -float('inf')
# Select top-n_unmask positions by confidence (vectorized)
_, topk_idx = max_log_probs.topk(min(n_unmask, is_masked.sum(-1).min().item()),
dim=-1)
# Sample tokens at selected positions
selected_logits = logits.gather(1, topk_idx.unsqueeze(-1).expand(-1, -1, logits.size(-1)))
sampled = torch.multinomial(
F.softmax(selected_logits.view(-1, logits.size(-1)) / temperature, dim=-1),
num_samples=1
).view(B, -1)
# Unmask: scatter sampled tokens into draft_ids
draft_ids.scatter_(1, topk_idx, sampled)
return draft_ids
Four forward passes through a model with 4 layers and 256 hidden dimension. On an RTX 5090, this takes approximately 0.3ms — producing 16 candidate tokens in the time it takes the target model to produce one.
The Scorer: Why This Is Lossless
Speculative decoding's guarantee comes from rejection sampling. The scorer compares the draft model's probability distribution against the target model's distribution for each token:
If the draft model assigned probability to token and the target model assigns , we accept with probability . On rejection, we resample from the residual distribution:
This is the standard speculative decoding acceptance scheme from Leviathan et al. (2023). The mathematical proof is straightforward: the combined accept/resample procedure produces samples from exactly , regardless of how bad the draft model is. A bad draft model just means more rejections (lower speedup), never wrong outputs.
class DFlashScorer:
def score(self, draft_tokens, draft_probs, target_logits, temperature=1.0):
target_probs = F.softmax(target_logits / temperature, dim=-1)
B, N = draft_tokens.shape
accepted = torch.ones(B, N, dtype=torch.bool, device=draft_tokens.device)
for pos in range(N):
tok = draft_tokens[:, pos]
p = target_probs[:, pos].gather(1, tok.unsqueeze(1)).squeeze(1)
q = draft_probs[:, pos].gather(1, tok.unsqueeze(1)).squeeze(1)
ratio = (p / q.clamp(min=1e-10)).clamp(max=1.0)
accept = torch.rand(B, device=tok.device) < ratio
accepted[:, pos] = accept
# First rejection truncates the rest
if not accept.all():
accepted[:, pos+1:] = False
break
return accepted, target_probs
The expected speedup is:
where is the block size (16), is the acceptance rate, and is the cost ratio of the draft model to the target model. For a well-calibrated draft model () and a negligible draft cost (), the theoretical speedup is approximately 6.3×.
KV Cache Extraction: Playing Nice with TurboQuant
The hardest engineering problem wasn't the diffusion model — it was extracting KV context from the target model's cache. vLLM uses paged KV caching: the cache is stored in fixed-size blocks (typically 16 tokens per block), and each sequence's blocks aren't necessarily contiguous in memory. A sequence's key-value pairs might be scattered across blocks [7, 23, 41, 2, 88].
We need contiguous K/V tensors for the draft model's cross-attention. The DFlashKVExtractor reconstructs them:
def extract_kv(self, kv_cache, block_table, context_len, layer_idx=0):
"""Reconstruct contiguous K,V from vLLM's paged block table."""
num_blocks = (context_len + self.block_size - 1) // self.block_size
blocks = block_table[:num_blocks]
k_blocks = [kv_cache[0][layer_idx][b] for b in blocks]
v_blocks = [kv_cache[1][layer_idx][b] for b in blocks]
K = torch.cat(k_blocks, dim=0)[:context_len]
V = torch.cat(v_blocks, dim=0)[:context_len]
return K, V
But there's a complication: we run TurboQuant. The KV cache isn't in FP16 — it's in tq-t4nc, a 4-bit quantized format with Haar rotation and PolarQuant MSE. The raw cache blocks contain packed 4-bit integers, not floating-point key/value vectors.
The extractor handles both paths:
if self._is_tq_compressed(kv_cache, layer_idx):
# TQ cache: dequantize before cross-attention
K, V = self._extract_tq(kv_cache, block_table, context_len, layer_idx)
else:
# Standard FP16/FP8: direct copy
K, V = self._extract_standard(kv_cache, block_table, context_len, layer_idx)
The TQ path calls the TurboQuant dequantization kernel — inverse scalar quantization, inverse Haar rotation, denormalization — to reconstruct approximate FP16 K/V tensors. The approximation error from 4-bit quantization (MSE ≈ 0.0095 per coordinate) actually helps prevent the draft model from overfitting to specific key/value representations. It's a feature, not a bug.
Wiring into vLLM: Import Hooks and Monkey-Patching
The question is: how do you get a custom speculative decoding method into a running vLLM server without forking vLLM (again)?
Our answer: Python import hooks. We already use this mechanism for TurboQuant — sitecustomize.py intercepts vLLM imports and applies patches at load time. DFlash adds a fourth phase to the same hook:
# Phase 4: DFlash block diffusion speculative decoding
if (not _dflash_applied
and os.environ.get("DFLASH_ENABLED", "0") == "1"
and name == "vllm.v1.worker.gpu_model_runner"):
_dflash_applied = True
from lib.gpu.dflash.vllm_hooks import apply_dflash_hooks
ok = apply_dflash_hooks()
print(f"[DFlash] pid={os.getpid()}: Hooks {'OK' if ok else 'STANDALONE'}")
The hook triggers when vllm.v1.worker.gpu_model_runner is imported — this is after the model is loaded but before inference begins. apply_dflash_hooks() tries three strategies:
- Patch
SpecDecodeWorker.__init__: Inject the DFlash proposer into vLLM's native speculative decoding pipeline - Patch
TP1DraftModelRunner.execute_model: Replace the draft model's execution with DFlash's block diffusion - vLLM plugin system: Register via the
general_pluginsentry point
If all three fail, DFlash falls back to standalone mode — the draft model is loaded but waits for explicit invocation rather than automatic inline speculation.
The Container Architecture
DFlash runs as an in-process sidecar. There is no separate container, no separate GPU allocation, no inter-process communication. The draft model (87MB weights, ~4 layers, 256 hidden dim) lives in the same CUDA context as the target model. It shares the same GPU memory pool, the same CUDA streams, and reads directly from the target model's KV cache.
The Docker compose entrypoint handles setup:
# DFlash hooks via sitecustomize
if [ "${DFLASH_ENABLED}" = "1" ]; then
mkdir -p /scripts/tq_site
cp /scripts/tq_sitecustomize.py /scripts/tq_site/sitecustomize.py
export PYTHONPATH="/scripts/tq_site:${PYTHONPATH}"
export AITHER_TQ_BITS=0 # TQ is baked into the fork — no hook-based TQ
fi
The AITHER_TQ_BITS=0 is important: the TQ vLLM fork already has TurboQuant support compiled in. Setting TQ bits to 0 tells the sitecustomize hook to skip TQ phases 1-3 and only run phase 4 (DFlash). Without this, the sitecustomize hook would try to double-patch TQ — once from the fork's native support and once from the hook — causing conflicts.
Calibration: Teaching the Draft Model Your Traffic
An uncalibrated draft model produces random tokens. The acceptance rate is near zero, and speculative decoding degrades to slower-than-baseline (you pay the draft model's cost for zero speedup). Calibration is essential.
Our calibration pipeline has four stages:
Stage 1: Sequence Collection
Hit the live vLLM API with 40+ diverse prompts spanning code generation, creative writing, analysis, math, and conversation:
prompts = [
"Explain quantum entanglement to a 10-year-old",
"Write a Python function to implement a red-black tree",
"What are the economic implications of universal basic income?",
"Translate this to French: The weather is beautiful today",
# ... 36 more covering the full distribution of real traffic
]
We collect 500 completions at max_tokens=256, temperature=0.7 — matching our production sampling parameters. This gives us approximately 128,000 tokens of representative output.
Stage 2: Tokenization
Each completion is tokenized via the vLLM server's /tokenize endpoint (using the same tokenizer as production inference):
resp = requests.post(f"{api_url}/tokenize",
json={"model": model_name, "prompt": text})
token_ids = resp.json()["tokens"]
Stage 3: Synthetic KV Projection
We need training data in the form of (token_ids, context_hidden_states) pairs. Extracting actual hidden states from the running model would require hooking into the forward pass — complex and fragile. Instead, we build synthetic KV projections:
# Deterministic random projection: token embeddings → hidden dim
torch.manual_seed(42)
embed = nn.Embedding(vocab_size, hidden_dim)
projection = torch.randn(hidden_dim, hidden_dim) * 0.02
for sequence in sequences:
token_tensor = torch.tensor(sequence.token_ids)
context_hidden = embed(token_tensor) @ projection
dataset.append((token_tensor, context_hidden))
The projection doesn't need to match the actual model's representations — it just needs to be a consistent mapping from token sequences to vector spaces. The draft model learns the relationship between token sequences and their projected contexts, not the absolute values. During inference, it receives real hidden states from the target model.
Stage 4: Training
The training objective is the diffusion denoising loss: given a sequence with randomly masked positions at a random timestep, predict the original tokens:
# For each batch:
# 1. Sample a random timestep t
# 2. Mask positions according to the noise schedule at t
# 3. Forward pass: predict logits for masked positions
# 4. Loss = cross-entropy(logits[masked], original_tokens[masked])
We train for 20 epochs with AdamW (lr=1e-3, weight_decay=0.01) and linear warmup. On 500 sequences, training takes approximately 12 minutes on the RTX 5090. The final model is saved to a persistent Docker volume (aither-dflash-models) that survives container rebuilds.
The Numbers
On our RTX 5090 (32GB GDDR7, Blackwell SM_120) running Nemotron-Orchestrator-8B-AWQ with TurboQuant tq-t4nc KV cache:
| Metric | Without DFlash | With DFlash | Improvement |
|---|---|---|---|
| Tokens/second (single stream) | ~85 tok/s | ~350-510 tok/s | 4-6× |
| Time-to-first-token | 45ms | 45ms | Same |
| Draft model VRAM | 0 | ~87MB | <0.3% of 32GB |
| Draft latency (16 tokens) | N/A | ~0.3ms | — |
| Acceptance rate (calibrated) | N/A | ~80-92% | — |
| Output quality | Baseline | Identical | Lossless |
The speedup varies by task. Highly predictable text (code completions, boilerplate) hits 6× consistently. Creative writing and reasoning with high entropy per token drops to 3-4×. The acceptance rate is the dominant factor — and it's entirely determined by how well the draft model matches the target model's distribution for your specific workload.
Time-to-first-token is unchanged because DFlash only affects the decode phase. Prefill (processing the prompt) is unmodified.
What Makes This Different from EAGLE/Medusa
The key architectural differences:
EAGLE/EAGLE-2/EAGLE-3: Uses the target model's last hidden state to autoregressively predict the next token's feature vector. Still sequential in the draft phase — each speculated token depends on the previous one. Tree-based verification helps but doesn't eliminate the sequential dependency.
Medusa: Adds multiple prediction heads to the target model. Each head predicts a different future position independently. Parallel, but limited — each head is a simple linear projection that can't capture inter-token dependencies.
DFlash (Block Diffusion): Uses iterative denoising over all positions simultaneously. The bidirectional self-attention means every draft position conditions on every other draft position at every step. The model can express "if position 3 is for, then position 4 should be i and position 7 should be range" — exactly the kind of structured prediction that matters for code and structured text.
The result: higher acceptance rates at larger block sizes. EAGLE-3 degrades quickly beyond block_size=8. DFlash maintains >80% acceptance at block_size=16 because the diffusion model can reason about long-range dependencies within the block.
Activation
DFlash is controlled by a single environment variable:
DFLASH_ENABLED=1
That's it. The sitecustomize hook handles the rest. Configuration is optional:
DFLASH_BLOCK_SIZE=16 # Tokens per draft block
DFLASH_DIFFUSION_STEPS=4 # Denoising iterations
DFLASH_MODEL_PRESET=nemotron-8b # Config preset
DFLASH_COMPILE=1 # torch.compile the draft model
DFLASH_TEMPERATURE=1.0 # Draft sampling temperature
Calibration is a one-time step:
docker exec aither-vllm-tq python3 /scripts/dflash-calibrate.py \
--preset nemotron-8b \
--api-url http://localhost:8199 \
--save-path /models/dflash \
--num-sequences 500 \
--epochs 20
The calibrated model persists in a Docker named volume. Rebuild the container, swap the target model — the draft model survives.
What's Next
Three things we're working on for DFlash v2:
-
Online recalibration: The draft model should continuously improve from production traffic. Every accepted/rejected token is a training signal. We're building a background micro-trainer that fine-tunes the draft model during idle GPU cycles — the same Dark Factory infrastructure we use for LoRA training.
-
Model-specific presets: Different model families have different token distributions. The current preset system (
nemotron-8b,qwen3.5-8b,deepseek-r1-14b, etc.) provides architecture configs. We want to ship pre-calibrated draft models for common model families so you get 80%+ acceptance rate out of the box. -
Adaptive block sizing: When acceptance rate drops below 60%, shrink the block size. When it's above 90%, grow it. Dynamic block sizing maximizes throughput across varying workload entropy.
The Source
All code is in AitherOS/lib/gpu/dflash/:
| File | Lines | Purpose |
|---|---|---|
config.py | 180 | Presets for 7 model families, env var loading |
attention.py | 150 | Bidirectional self-attention + cross-attention + SwiGLU |
noise_schedule.py | 80 | Cosine/linear/sqrt schedules + unmask ordering |
model.py | 280 | Draft model: embed → N layers → output head |
scorer.py | 100 | Rejection sampling verifier |
vllm_proposer.py | 200 | vLLM adapter + KV cache extractor |
vllm_hooks.py | 150 | Runtime monkey-patching (3 strategies) |
trainer.py | 220 | Calibration training loop |
Plus scripts/dflash-calibrate.py (300 lines) for the end-to-end calibration pipeline.
35 unit tests. All passing. No external dependencies beyond PyTorch and the vLLM process it runs inside.
DFlash is part of AitherOS's GPU inference stack, alongside TurboQuant (sub-byte KV cache compression) and graph-aware eviction (semantic KV block management). Together, they give a single RTX 5090 the throughput characteristics of a much larger deployment — 4.7 million addressable KV tokens, 3,400+ tok/s aggregate, and now 6× faster single-stream decode. All on hardware you can buy at Best Buy.