Gemma 4 Hot-Swap: How We Got Google's MoE Reasoning Model Running on a Single GPU
Google released Gemma 4 27B-A4B in April 2026: a Mixture-of-Experts reasoning model with 26 billion parameters but only 4 billion active per token. On paper, it's the perfect challenger to our existing DeepSeek-R1:14b local reasoning model — comparable parameter budget at inference time, native <|think|> tokens for chain-of-thought, and Apache 2.0 licensed.
We already had the hot-swap infrastructure: two vLLM containers sharing a network alias on port 8176, one docker compose stop and docker compose up away from switching reasoning backends. Zero downtime for the orchestrator. The architecture paid for itself before we wrote a single line of Gemma-specific code.
Getting it to actually start was a different story.
Bug #1: The Transformers Pin
vLLM 0.19 ships with Gemma 4 model code. The model architecture class Gemma4ForConditionalGeneration is right there in the source tree. So you'd think --model cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit would Just Work.
It does not. vLLM 0.19 pins transformers<5. Gemma 4's architecture needs transformers>=5.5.0. The model class exists in vLLM, but the HuggingFace config loader can't parse the model's config.json without the newer transformers library.
First attempt: pip install --no-deps 'transformers>=5.5.0'. This gets past the version pin, but crashes immediately:
ImportError: cannot import name 'is_offline_mode' from 'huggingface_hub'
Transformers 5.x imports newer symbols from huggingface_hub that don't exist in the version vLLM ships with. The --no-deps flag skipped the hub upgrade.
Fix: drop --no-deps, install both packages:
pip install 'transformers>=5.5.0' 'huggingface_hub>=0.32'
This is tracked upstream in vllm-project/vllm#39216. Until vLLM drops the transformers pin (likely 0.20), you need this pip dance at container startup.
Bug #2: The Quantization Lie
The model card for cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit says AWQ. The filename says AWQ. The HuggingFace tags say AWQ. So naturally we passed --quantization awq_marlin to get the fast Marlin dequantization kernels.
vLLM disagreed:
ValueError: Quantization method specified in the model config (compressed-tensors)
does not match the quantization method specified in the command line (awq_marlin).
What happened: the cyankiwi checkpoint was quantized with AutoAWQ, which since late 2025 serializes weights in compressed-tensors format — not the raw AWQ format that vLLM's awq_marlin backend expects. The model's config.json correctly says compressed-tensors, but the model card doesn't mention this.
Fix: --quantization compressed-tensors. vLLM automatically selects Marlin W4A16 MoE kernels (CompressedTensorsWNA16MarlinMoEMethod) for the dequantization path. You get the same fast Marlin kernels — just through a different loader.
Lesson: always cat config.json | grep quant on the actual checkpoint. Model cards lie.
Bug #3: The VRAM Surprise
Here's where MoE pricing gets deceptive. "26B params, 4B active" sounds like you're loading a 4B model. Our back-of-envelope estimate: ~6 GB for 4-bit weights, fits comfortably alongside the orchestrator at gpu_memory_utilization=0.40.
Wrong. MoE stores ALL 26 billion parameters in VRAM. The expert routing just activates 4B per forward pass, but all expert weights must be resident for the router to select from them. The actual weight file is 16.48 GB.
At gpu_util=0.40 (13 GB budget on our RTX 5090's 32 GB), the weights don't even fit. vLLM crashes during model loading with an OOM.
The fix required replanning the VRAM budget:
| Component | Before | After |
|---|---|---|
| Orchestrator (8B, TQ4) | 0.40 (13 GB) | 0.40 (13 GB) |
| Reasoning model budget | 0.40 (13 GB) | 0.55 (18 GB) |
| Gemma 4 weights | — | 16.48 GB |
| KV cache (fp8, 8K ctx) | — | ~0.8 GB |
| Context length | 32K | 8K |
We dropped max context from 32K to 8K and raised GPU utilization to 0.55. The orchestrator sleeps during Gemma 4's cold start (the entrypoint handles this automatically), then wakes up once the reasoning model is healthy. Final equilibrium: orchestrator (13 GB) + Gemma 4 active (~6.8 GB) = ~19.8 GB / 32 GB. Tight, but it works.
The key insight: MoE "active parameter" counts describe compute cost, not memory cost. Always check the actual safetensors file size.
The Heterogeneous Head Problem
This is the real performance blocker, and it's architectural — not a configuration mistake.
Gemma 4 has two types of attention layers:
- Local attention layers:
head_dim=256 - Global attention layers:
head_dim=512
Most transformer models use uniform head dimensions across all layers. CUDA graph capture requires static tensor shapes — you record a computation graph once, then replay it without the overhead of Python dispatch. When attention layer dimensions alternate between 256 and 512, the graph capture fails because the tensor shapes change layer-to-layer.
vLLM detects this heterogeneity in Gemma 4's config and auto-forces TRITON_ATTN as the attention backend (FlashInfer can't handle it either). The recommendation is --enforce-eager — no CUDA graphs, no torch.compile.
The performance impact is severe:
| Metric | DeepSeek-R1:14b | Gemma 4 26B-A4B |
|---|---|---|
| Generation throughput | ~42 tok/s | ~15.2 tok/s |
| CUDA graphs | piecewise | none |
| torch.compile | enabled | disabled |
| KV cache dtype | tq-t4nc (4-bit) | fp8_e4m3 (8-bit) |
| Speculative decoding | DFlash native | none |
| Attention backend | TRITON_ATTN | TRITON_ATTN |
15.2 tok/s is usable for reasoning (short, bursty requests), but it's a 2.8x throughput penalty compared to DeepSeek running the full TurboQuant pipeline. For interactive chat with reasoning-in-the-loop, that latency adds up.
What's Working
Despite the performance gap, the inference quality is correct and the infrastructure is solid:
-
TurboQuant image: Gemma 4 runs on our
aither-vllm-tq:latestDocker image. The TQ KV cache plugin is loaded (logs show "Registered TurboQuant CUSTOM backend"), just using fp8_e4m3 instead of the full tq-t4nc pipeline. -
Marlin MoE kernels:
CompressedTensorsWNA16MarlinMoEMethodhandles weight dequantization. 4-bit weights, 16-bit compute — fast matmuls even without CUDA graphs. -
TRITON_ATTN backend: Correct choice for heterogeneous head dimensions. Handles the 256/512 alternation natively.
-
Hot-swap architecture: One
docker compose stop aither-vllm-deepseek && docker compose up -d aither-vllm-gemma4and the reasoning endpoint switches. The orchestrator doesn't care which model is behind port 8176. -
Tool calling:
--enable-auto-tool-choice --tool-call-parser hermesgives Gemma 4 structured tool output, same as DeepSeek. Agent dispatch works unchanged.
The Optimization Roadmap
We have three levers to close the throughput gap, in order of confidence:
TQ tq-t4nc KV Cache (shipped in aither-kvcache v2.1.0): Switch --kv-cache-dtype fp8_e4m3 to tq-t4nc. The TurboQuant KV cache is a per-token encode/decode operation that hooks into TritonAttentionImpl.forward(), which Gemma 4 already uses via TRITON_ATTN.
The initial attempt crashed immediately:
NotImplementedError: The page size of the layer is not divisible by the maximum page size.
Cannot unify by adjusting block_size.
Root cause: three layers of uniform head_dim assumptions. The global _tq_quantizer singleton created a single quantizer for head_dim=256 (the first layer it saw), then tried to use it for head_dim=512 global attention layers. Page sizes computed per-layer weren't divisible by each other, failing vLLM's unify_kv_cache_spec_page_size(). And hardcoded head_size // 2 scattered throughout encode/decode paths assumed 4-bit packing always produces head_dim / 2 bytes regardless of which quantizer instance handles that layer.
The fix: replace the global singleton with a per-head-dim quantizer dict (_tq_quantizers[head_dim]), add power-of-2 page-size alignment for cross-layer divisibility, and derive packed dimensions from the actual layer's head size everywhere. Backward compatible — uniform models like DeepSeek see a dict with one entry instead of a scalar.
Piecewise CUDA Graphs (medium confidence): Replace --enforce-eager with --compilation-config {"cudagraph_mode":"piecewise"}. Piecewise graphs capture individual ops rather than the full forward pass, handling graph breaks gracefully. The question is whether vLLM's piecewise capture handles the layer-to-layer shape changes in MoE attention. If it works, this is a ~2x throughput improvement. If it fails, we keep eager mode and still get the TQ KV cache gains.
DFlash Speculative Decoding (high confidence once calibrated): Our DFlash native pipeline is model-architecture-agnostic — it trains a small draft head on the target model's hidden states. The existing dflash-calibrate-native.py script should work when pointed at the Gemma 4 model. Start conservative with 8 speculative tokens (vs 16 for DeepSeek) since MoE's diverse expert outputs may be harder for the draft head to predict.
Target: 30+ tok/s with TQ KV + piecewise graphs, potentially 40+ with DFlash.
Lessons Learned
vLLM's "supported models" page is aspirational, not factual. Gemma 4 is listed as supported. It is not, without manual transformers upgrades that break vLLM's own dependency constraints. Always test in a disposable container before committing to a model swap.
MoE weight sizes are deceptive. "4B active" means 4B compute, not 4B storage. All experts live in VRAM. Budget for the full parameter count, not the active count.
Always read config.json. Model cards say AWQ. Model configs say compressed-tensors. The config is truth; the card is marketing. cat config.json | grep quant before you write your --quantization flag.
Hot-swap architecture is worth building early. Three bugs, three fixes, zero downtime. The shared network alias pattern (aither-vllm-reasoning:8176) meant the orchestrator never knew we were debugging. Build the abstraction before you need it.
Performance isn't just model quality. Gemma 4's reasoning quality is competitive with DeepSeek-R1:14b. But at 15 tok/s vs 42 tok/s, the faster model wins in production even if the slower model is marginally better at reasoning. Throughput is a feature.
This is part of our series on building AitherOS, an AI-powered agent operating system. All inference runs on a single RTX 5090 with 32 GB VRAM.