Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

Live

Connecting to services…

•

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

vLLMGemma-4MoEreasoningTurboQuantCUDA-graphsinferencehot-swapAWQcompressed-tensors

Gemma 4 Hot-Swap: How We Got Google's MoE Reasoning Model Running on a Single GPU

April 14, 20268 min readDavid Parkhurst

Google released Gemma 4 27B-A4B in April 2026: a Mixture-of-Experts reasoning model with 26 billion parameters but only 4 billion active per token. On paper, it's the perfect challenger to our existing DeepSeek-R1:14b local reasoning model — comparable parameter budget at inference time, native <|think|> tokens for chain-of-thought, and Apache 2.0 licensed.

We already had the hot-swap infrastructure: two vLLM containers sharing a network alias on port 8176, one docker compose stop and docker compose up away from switching reasoning backends. Zero downtime for the orchestrator. The architecture paid for itself before we wrote a single line of Gemma-specific code.

Getting it to actually start was a different story.

Bug #1: The Transformers Pin

vLLM 0.19 ships with Gemma 4 model code. The model architecture class Gemma4ForConditionalGeneration is right there in the source tree. So you'd think --model cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit would Just Work.

It does not. vLLM 0.19 pins transformers<5. Gemma 4's architecture needs transformers>=5.5.0. The model class exists in vLLM, but the HuggingFace config loader can't parse the model's config.json without the newer transformers library.

First attempt: pip install --no-deps 'transformers>=5.5.0'. This gets past the version pin, but crashes immediately:

ImportError: cannot import name 'is_offline_mode' from 'huggingface_hub'

Transformers 5.x imports newer symbols from huggingface_hub that don't exist in the version vLLM ships with. The --no-deps flag skipped the hub upgrade.

Fix: drop --no-deps, install both packages:

pip install 'transformers>=5.5.0' 'huggingface_hub>=0.32'

This is tracked upstream in vllm-project/vllm#39216. Until vLLM drops the transformers pin (likely 0.20), you need this pip dance at container startup.

Bug #2: The Quantization Lie

The model card for cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit says AWQ. The filename says AWQ. The HuggingFace tags say AWQ. So naturally we passed --quantization awq_marlin to get the fast Marlin dequantization kernels.

vLLM disagreed:

ValueError: Quantization method specified in the model config (compressed-tensors)
does not match the quantization method specified in the command line (awq_marlin).

What happened: the cyankiwi checkpoint was quantized with AutoAWQ, which since late 2025 serializes weights in compressed-tensors format — not the raw AWQ format that vLLM's awq_marlin backend expects. The model's config.json correctly says compressed-tensors, but the model card doesn't mention this.

Fix: --quantization compressed-tensors. vLLM automatically selects Marlin W4A16 MoE kernels (CompressedTensorsWNA16MarlinMoEMethod) for the dequantization path. You get the same fast Marlin kernels — just through a different loader.

Lesson: always cat config.json | grep quant on the actual checkpoint. Model cards lie.

Bug #3: The VRAM Surprise

Here's where MoE pricing gets deceptive. "26B params, 4B active" sounds like you're loading a 4B model. Our back-of-envelope estimate: ~6 GB for 4-bit weights, fits comfortably alongside the orchestrator at gpu_memory_utilization=0.40.

Wrong. MoE stores ALL 26 billion parameters in VRAM. The expert routing just activates 4B per forward pass, but all expert weights must be resident for the router to select from them. The actual weight file is 16.48 GB.

At gpu_util=0.40 (13 GB budget on our RTX 5090's 32 GB), the weights don't even fit. vLLM crashes during model loading with an OOM.

The fix required replanning the VRAM budget:

Component	Before	After
Orchestrator (8B, TQ4)	0.40 (13 GB)	0.40 (13 GB)
Reasoning model budget	0.40 (13 GB)	0.55 (18 GB)
Gemma 4 weights	—	16.48 GB
KV cache (fp8, 8K ctx)	—	~0.8 GB
Context length	32K	8K

We dropped max context from 32K to 8K and raised GPU utilization to 0.55. The orchestrator sleeps during Gemma 4's cold start (the entrypoint handles this automatically), then wakes up once the reasoning model is healthy. Final equilibrium: orchestrator (13 GB) + Gemma 4 active (~6.8 GB) = ~19.8 GB / 32 GB. Tight, but it works.

The key insight: MoE "active parameter" counts describe compute cost, not memory cost. Always check the actual safetensors file size.

The Heterogeneous Head Problem

This is the real performance blocker, and it's architectural — not a configuration mistake.

Gemma 4 has two types of attention layers:

Local attention layers: head_dim=256
Global attention layers: head_dim=512

Most transformer models use uniform head dimensions across all layers. CUDA graph capture requires static tensor shapes — you record a computation graph once, then replay it without the overhead of Python dispatch. When attention layer dimensions alternate between 256 and 512, the graph capture fails because the tensor shapes change layer-to-layer.

vLLM detects this heterogeneity in Gemma 4's config and auto-forces TRITON_ATTN as the attention backend (FlashInfer can't handle it either). The recommendation is --enforce-eager — no CUDA graphs, no torch.compile.

The performance impact is severe:

Metric	DeepSeek-R1:14b	Gemma 4 26B-A4B
Generation throughput	~42 tok/s	~15.2 tok/s
CUDA graphs	piecewise	none
torch.compile	enabled	disabled
KV cache dtype	tq-t4nc (4-bit)	fp8_e4m3 (8-bit)
Speculative decoding	DFlash native	none
Attention backend	TRITON_ATTN	TRITON_ATTN

15.2 tok/s is usable for reasoning (short, bursty requests), but it's a 2.8x throughput penalty compared to DeepSeek running the full TurboQuant pipeline. For interactive chat with reasoning-in-the-loop, that latency adds up.

What's Working

Despite the performance gap, the inference quality is correct and the infrastructure is solid:

TurboQuant image: Gemma 4 runs on our aither-vllm-tq:latest Docker image. The TQ KV cache plugin is loaded (logs show "Registered TurboQuant CUSTOM backend"), just using fp8_e4m3 instead of the full tq-t4nc pipeline.
Marlin MoE kernels: CompressedTensorsWNA16MarlinMoEMethod handles weight dequantization. 4-bit weights, 16-bit compute — fast matmuls even without CUDA graphs.
TRITON_ATTN backend: Correct choice for heterogeneous head dimensions. Handles the 256/512 alternation natively.
Hot-swap architecture: One docker compose stop aither-vllm-deepseek && docker compose up -d aither-vllm-gemma4 and the reasoning endpoint switches. The orchestrator doesn't care which model is behind port 8176.
Tool calling: --enable-auto-tool-choice --tool-call-parser hermes gives Gemma 4 structured tool output, same as DeepSeek. Agent dispatch works unchanged.

The Optimization Roadmap

We have three levers to close the throughput gap, in order of confidence:

TQ tq-t4nc KV Cache (shipped in aither-kvcache v2.1.0): Switch --kv-cache-dtype fp8_e4m3 to tq-t4nc. The TurboQuant KV cache is a per-token encode/decode operation that hooks into TritonAttentionImpl.forward(), which Gemma 4 already uses via TRITON_ATTN.

The initial attempt crashed immediately:

NotImplementedError: The page size of the layer is not divisible by the maximum page size.
Cannot unify by adjusting block_size.

Root cause: three layers of uniform head_dim assumptions. The global _tq_quantizer singleton created a single quantizer for head_dim=256 (the first layer it saw), then tried to use it for head_dim=512 global attention layers. Page sizes computed per-layer weren't divisible by each other, failing vLLM's unify_kv_cache_spec_page_size(). And hardcoded head_size // 2 scattered throughout encode/decode paths assumed 4-bit packing always produces head_dim / 2 bytes regardless of which quantizer instance handles that layer.

The fix: replace the global singleton with a per-head-dim quantizer dict (_tq_quantizers[head_dim]), add power-of-2 page-size alignment for cross-layer divisibility, and derive packed dimensions from the actual layer's head size everywhere. Backward compatible — uniform models like DeepSeek see a dict with one entry instead of a scalar.

Piecewise CUDA Graphs (medium confidence): Replace --enforce-eager with --compilation-config {"cudagraph_mode":"piecewise"}. Piecewise graphs capture individual ops rather than the full forward pass, handling graph breaks gracefully. The question is whether vLLM's piecewise capture handles the layer-to-layer shape changes in MoE attention. If it works, this is a ~2x throughput improvement. If it fails, we keep eager mode and still get the TQ KV cache gains.

DFlash Speculative Decoding (high confidence once calibrated): Our DFlash native pipeline is model-architecture-agnostic — it trains a small draft head on the target model's hidden states. The existing dflash-calibrate-native.py script should work when pointed at the Gemma 4 model. Start conservative with 8 speculative tokens (vs 16 for DeepSeek) since MoE's diverse expert outputs may be harder for the draft head to predict.

Target: 30+ tok/s with TQ KV + piecewise graphs, potentially 40+ with DFlash.

Lessons Learned

vLLM's "supported models" page is aspirational, not factual. Gemma 4 is listed as supported. It is not, without manual transformers upgrades that break vLLM's own dependency constraints. Always test in a disposable container before committing to a model swap.

MoE weight sizes are deceptive. "4B active" means 4B compute, not 4B storage. All experts live in VRAM. Budget for the full parameter count, not the active count.

Always read config.json. Model cards say AWQ. Model configs say compressed-tensors. The config is truth; the card is marketing. cat config.json | grep quant before you write your --quantization flag.

Hot-swap architecture is worth building early. Three bugs, three fixes, zero downtime. The shared network alias pattern (aither-vllm-reasoning:8176) meant the orchestrator never knew we were debugging. Build the abstraction before you need it.

Performance isn't just model quality. Gemma 4's reasoning quality is competitive with DeepSeek-R1:14b. But at 15 tok/s vs 42 tok/s, the faster model wins in production even if the slower model is marginally better at reasoning. Throughput is a feature.

This is part of our series on building AitherOS, an AI-powered agent operating system. All inference runs on a single RTX 5090 with 32 GB VRAM.

Enjoyed this post?

All posts Try AitherOS