Gemma 4 12B on the DGX Spark: Parallel Co-Residence and Perceive-Then-Reason
Two months ago we wrote about hot-swapping Google's Gemma 4 26B MoE onto a single GPU. That model was a reasoning challenger — text in, chain-of-thought out. On June 3rd, Google shipped something different enough that it broke our mental model of where a model lives in the stack: Gemma 4 12B, an encoder-free unified model.
Encoder-free means there's no separate vision tower bolted onto a language model. Image pixels, audio waveforms, and text tokens are all projected into the same embedding space and run through one decoder. The model doesn't call a vision model — it is one. 256K context, native audio and video, Apache 2.0, and small enough that Google's marketing says it "fits in 16 GB."
That last claim is true, and also a trap. Here's what it actually took to run it in AitherOS.
The serving reality nobody puts in the launch post
Three constraints surfaced before we wrote a line of config:
- It needs a nightly vLLM. The encoder-free 12B architecture landed in vLLM PR #44429, which merged into
mainon June 3rd — the same day Google shipped the model — and is not in the 0.22.0 stable release. We confirmed it: stockvllm/vllm-openai:v0.22.0listsGemma4ForConditionalGenerationbut noGemma4Unified; the nightly (0.22.1rc1.dev166) has it. The model'sconfig.jsondeclaresGemma4UnifiedForConditionalGeneration/model_type: gemma4_unifiedand is built againsttransformers 5.10— our productionaither-vllm-tqimage (vLLM 0.19.1) refuses it outright. You need a pinned nightly, and that transformers pin matters more than you'd think (see the AWQ section). - FP8 weight quantization produces gibberish. vLLM issue #39049 — dynamic FP8 on Gemma 4 emits garbage tokens. So FP8 is for the KV cache only; the weights stay bf16.
- There is no AWQ-4bit build yet. The community has AWQ for the 26B and 31B, but not the new 12B. bf16 weights mean ~24 GB resident, not 16.
That third point is the whole story. "Fits in 16 GB" assumes a quant that doesn't exist. At bf16 it's a 24 GB model, and 24 GB is a very awkward number on a 32 GB card.
The 32 GB arithmetic that doesn't close
Our desktop runs an RTX 5090 — 32 GB. The steady state holds three things that all want VRAM:
| Consumer | VRAM (bf16) |
|---|---|
| Orchestrator (Nemotron-8B, 4-bit) | ~8 GB |
| Gemma 4 12B (bf16) | ~24 GB |
| Sana / ComfyUI-Flux image gen | ~10–16 GB |
8 + 24 = 32. The orchestrator and Gemma alone saturate the card — before a single pixel of image generation. You cannot hold all three. This isn't a tuning problem; it's a wall.
We have two honest ways through it, and we're building both.
Path 1 (now): parallel co-residence on the DGX Spark
The DGX Spark (GB10) has 128 GB of unified memory. That changes everything. Instead of fighting over 32 GB, Gemma 4 12B and our flagship reasoner, Qwen3.6-27B, both stay resident with room to spare.
The only catch: vLLM reserves memory as a fraction of the device (--gpu-memory-utilization), and our Qwen primary was set to grab 0.65 — about 83 GB — to hoard KV cache for long-context jobs. We trimmed it to make a roommate:
| Model (port) | gpu_util | ≈ GB |
|---|---|---|
| Qwen3.6-27B primary (8120) | 0.50 | ~64 |
| Gemma 4 12B (8124) | 0.30 | ~38 (24 weights + ~14 KV) |
| nomic embeddings (8121) | 0.04 | ~5 |
| Total | 0.84 | ~107 / 128 |
Trimming Qwen from 0.65 to 0.50 costs almost nothing, because Qwen already runs TurboQuant TQ3.5 KV compression — its effective context is large even from a smaller slot. The win is that both models are hot: no loading, no swapping, no cold-start penalty on any request.
Why parallel, and not swap?
The obvious alternative is to share one big slot: sleep Gemma, wake Qwen, and vice versa. vLLM supports it (--enable-sleep-mode offloads weights to host RAM and reloads on demand), and AitherOS already orchestrates exactly this dance on the 5090 for image generation.
For this workload, swap is the wrong tool — and the reason is the pipeline shape.
The whole point of running Gemma alongside Qwen is a two-stage flow where one feeds the other on every multimodal turn. If the two models share a slot, every single image request pays a wake penalty: reloading 24–27 GB from host RAM is 10–20 seconds of dead air. Swap makes sense when two models serve disjoint workloads at different times. It's a disaster when one model's output is the next model's input.
So: parallel for the pipeline, swap as a fallback lever. We still enabled sleep mode on the Gemma container — not for steady state, but so we can temporarily put Gemma to sleep and hand Qwen the full ~85 GB for an occasional giant-context job, then wake it.
The pipeline: Gemma perceives so Qwen can reason
Here's the architectural insight that makes the parallel layout worth the effort.
Qwen3.6-27B is itself multimodal — it has vision. So today, when an image arrives, AitherOS routes it straight to Qwen, and Qwen does everything: it processes the raw pixels and reasons about them. That's our flagship reasoner spending its capacity — and a DGX round-trip — on pixel parsing.
But we now have a model whose entire identity is perception. So we split the work:
orchestrator → image → Gemma 4 (perceive) → orchestrator → Qwen 3.6 (reason)
- Stage 1 — perceive. Gemma 4 12B receives the image and the user's actual question, and returns question-conditioned visual context as text. Not a generic caption — "extract everything relevant to answering: «the user's question»." Fast, local to the DGX.
- Stage 2 — reason. We strip the image, inject Gemma's extraction as text, and hand it to Qwen3.6 for deep reasoning over that context plus everything else in the conversation. Qwen never touches a pixel.
Because both models live on the same DGX, the handoff is a localhost hop — not a network round-trip. Qwen gets to be a pure reasoner. Gemma gets to be the eyes and ears.
Where this lives in the code
AitherOS has exactly one choke point for multimodal routing — AitherLLMQueue._submit_internal. Today it's a single-stage decision: if the request has an image and the target model can see, keep it; otherwise redirect to a vision-capable model. The two-stage pipeline slots in around that decision, gated on effort: a casual "what's in this image?" still goes to Gemma alone (involving Qwen would just add latency), while a genuine reasoning task at effort ≥ 7 triggers perceive-then-reason. That gate is the difference between smart and slow.
The model registry already cooperates: model_supports_vision() recognizes any gemma4* name as multimodal, so the new gemma4-12b served-name is treated as a first-class perceiver, not mis-routed.
Eyes and ears: the same split for audio and video
We called Gemma "the eyes and ears" — then made it literal. The 12B is unified: raw audio waveforms project into the same embedding space as image patches, so the perceive→reason split generalizes from images to the whole perception stack.
One choke point, three modalities. _perceive_then_reason now detects image, audio, and video content parts and feeds them all to the same perceiver — honoring the model card's quirks:
- Modality order matters. The card specifies image before text, audio after text. A reorder step enforces it; get it wrong and quality drops.
- Audio is capped at 30 seconds. This is the hinge. Gemma does native ASR + speech translation for short, conversational clips — but a 45-minute meeting recording is not its job. So a duration probe routes anything over the cap to Whisper large-v3, which stays exactly where it was for long-form transcription, diarization, and word-level timestamps. Gemma didn't replace Whisper; it took the conversational front half and left the archival back half alone.
- Video is frames. Gemma has no video encoder — it reads video as a sequence of images (≤60s @ 1fps). A small PyAV sampler turns a clip into frames, which flow through the identical image path.
The output side never moved. Gemma is text-out only — it doesn't speak and it doesn't draw — so TTS (XTTS/Coqui) and image generation (Sana/ComfyUI) are untouched. The unification is purely on the input side: one model to perceive everything coming in, the right specialist for everything going out.
The same logic lives in two places now: the chat hot-path (AitherLLMQueue) for media in a conversation, and AitherVoice/AitherVision for explicit tool calls — both pointed at the same gemma4-12b backend, so there's one perceiver, not three implementations drifting apart.
The honest part: a unified perceiver is a single point of failure
Co-locating Gemma and Qwen on the DGX is great for the happy path — but it means the DGX is now a single point of failure for perception. Worth being clear-eyed about how it degrades:
- Perceiver down, reasoner up: the two-stage call fails, falls back to single-stage, and Qwen — itself multimodal — handles the image in one pass. You lose the optimization, not the answer.
- Reasoner down, perceiver up: this is the case the split improves. Because Gemma already turned the pixels into text, the reasoning step can land on any text model — the local orchestrator, a cloud reasoner — with no vision requirement. The handoff to text is exactly what makes the fallback flexible.
- Whole DGX down: here's the weak spot. Text reasoning still degrades to the always-resident 5090 orchestrator, but that orchestrator can't see. A vision request during a full DGX outage has nowhere local to go, and the text-only cloud fallback can't read the image.
The answer is the same AWQ track that frees the 5090: once an 8 GB Gemma runs locally next to the orchestrator, perception stops depending on one box. Cloud vision (Gemini, Vast.ai Gemma) is the other backstop. A unified perceiver is a powerful primitive — but "one model that sees and hears" is also "one model whose absence blinds you," and the deployment has to plan for both.
The roadmap: AWQ and AitherKVCache
Parallel-on-DGX is the production answer today. But the desktop 5090 is where most development happens, and we want Gemma there too. Two tracks bring it home:
AWQ-4bit (~8 GB). This is the unlock for the 5090: at 8 GB, orchestrator(8) + gemma-awq(8) + ComfyUI(16) = 32 finally co-resides with no sleep dance at all. No community quant exists for the 12B, so we set out to build our own — and the build is where the encoder-free novelty bites back. Four times.
The first wall is the toolchain. The obvious tool is llm-compressor — the modern, vLLM-blessed AWQ path — and that's what our first scaffold reached for. It can't touch this model: llm-compressor 0.11 pins transformers <= 4.57.6, but gemma4_unified only exists in transformers 5.10. Installing the quantizer silently downgrades transformers until the model won't even load (KeyError: 'gemma4_unified'). The model is one day old; the AWQ tooling hasn't caught up to the transformers release it needs. The tool that does span both is Intel's AutoRound (auto_round 0.12.3, which requires only transformers >= 4.38 — no upper cap — and ships an AutoRoundMLLM path for exactly this kind of model). There's already a published Gemma-4 W4A16 quant in the wild made with it. So: AutoRound, not llm-compressor.
The second wall is where to point it. Quantize the wrong modules and you either corrupt perception or quantize nothing at all. Instantiating the model on a meta device — no weights, no gated download, just the graph — shows the encoder-free layout plainly: the 48-layer language decoder lives at model.language_model.layers.* (328 Linear modules, the entire reasoning core), while perception is three small fused projectors — model.embed_vision.patch_dense, model.embed_vision.multimodal_embedder.embedding_projection, and model.embed_audio.embedding_projection. We quantize the decoder to 4-bit and keep those three projectors plus lm_head in bf16; the multimodal towers are tiny and AWQ kernels are unproven on them, so they're not worth the risk. Note the trap: a naive recipe targeting model.layers.* — the path every other model uses — matches nothing here, because the decoder is one level deeper, nested under language_model. You only catch that by reading the actual module tree before you spend a GPU-hour.
The third wall is calibration itself. AutoRound's flagship mode learns per-layer rounding by running calibration data through each decoder block — and that forward pass crashes on Gemma 4: RuntimeError: tensor a (512) must match tensor b (256) inside apply_rotary_pos_emb. Gemma 4 alternates attention head dimensions (256 and 512) across its sliding-window and global layers; AutoRound caches one block's rotary embeddings and reuses them for a block with a different head dim. The model runs fine in vLLM — it's the calibration harness that can't cope with the heterogeneous heads, the same 256/512 thorn the 26B MoE hit. The escape hatch is RTN mode (--iters 0): pure weight-rounding, no calibration forward, so it sidesteps the crash entirely. You trade a point or two of quality for a quant that actually completes — all 48 layers in under a minute, peak VRAM under 2 GB.
The fourth wall is the export format. AutoRound's auto_awq export wraps even the ignored projectors as AWQ-packed tensors (qweight/qzeros), and vLLM's day-old gemma4_unified loader expects those multimodal projectors unquantized — so the server refuses to start with no parameter named 'vision_embedder.patch_dense.weight' (note vLLM even uses a different prefix, vision_embedder, than the checkpoint's model.embed_vision). The fix is to re-export in compressed-tensors — the same format the 26B AWQ-Gemma already serves on — which keeps ignored modules as plain bf16 .weight so vLLM's name-mapper can line them up.
Four walls, every one a consequence of the model being days old and the quant-and-serve toolchain still catching up. The decoder quant itself was never the hard part; the plumbing around a brand-new architecture was. Run it on a rented GPU so the desktop stays clean — and keep the recipe (RTN + compressed-tensors + the model.language_model.layers targeting) in version control, because every one of those four facts cost a GPU-hour to learn.
AitherKVCache / TurboQuant for Gemma. Qwen's DGX slot already runs TQ3.5 KV compression. Teaching those same hooks to compress Gemma 4 12B's KV cache — on the nightly vLLM, against its sliding-window-plus-global attention layout — makes 256K context nearly free and the parallel layout even roomier. That's active R&D, but it's the multiplier.
Until both land, the DGX serves bf16 weights + FP8 KV cache — exactly what's wired today.
What shipped
The configuration is in the tree: an opt-in local-gemma4-12b model-stack, a 5090 service (aither-vllm-gemma4-12b, port 8178, time-sliced with sleep mode) and a DGX service (aither-vllm-dgx-gemma4-12b, port 8124, parallel with Qwen), a dedicated vllm_gemma4_12b backend, the registry entry, the utilization rebalance, and the AWQ quantization scaffold (now built on AutoRound, not llm-compressor). The 4-bit weights themselves are the one piece still in flight — blocked not on compute but on finding a quantizer new enough to read a model this fresh. The two-stage perceive→reason path is wired and unit-tested across image, audio (with the 30s→Whisper guard), and video (frame sampling); AitherVoice and AitherVision route to the same perceiver. The default routing points at the DGX, because that's where this pipeline wants to live.
The lesson that keeps repeating with frontier open models: the launch post tells you what the model can do. The VRAM arithmetic tells you where it can live. A single unified model that sees and hears and reasons is genuinely new — but on real hardware, "it fits in 16 GB" still has to survive a 32 GB card with two other tenants already home.