One GPU, Four Models: How AitherOS Juggles 32 GB of VRAM
One GPU, Four Models: How AitherOS Juggles 32 GB of VRAM
April 16, 2026 -- David Parkhurst
The RTX 5090 has 32 GB of VRAM. Here is what AitherOS runs on it:
| Model | Quant | Purpose | GPU Util | VRAM Budget |
|---|---|---|---|---|
| Nemotron-Orchestrator-8B | AWQ 4-bit + TQ4 KV | Chat, routing, tool use, reflex | 40% | ~12.8 GB |
| DeepSeek-R1-distill-Qwen-14B | AWQ 4-bit + TQ4 KV | Reasoning (effort 7--10) | 50% | ~16 GB |
| nomic-embed-text-v1.5 | FP16 | Embeddings (CodeGraph, semantic search) | 5% | ~300 MB |
| SDXL / Illustrious / Flux | Native weights | Image generation (ComfyUI) | dynamic | 6--20 GB |
Three vLLM instances share the GPU: orchestrator at 40%, reasoning at 50%, embeddings at 5%. That is 95% allocated before ComfyUI even thinks about loading a diffusion model. And Gemma 4 -- the 26B MoE reasoning/vision model waiting in the wings as a hot-swap alternative to DeepSeek -- wants 55% by itself.
The naive solution is to pick one model and accept the constraints. We chose the other path: run them all, but coordinate who is awake and who is asleep, with a scheduler that reclaims VRAM on demand. The result is a system where users can chat, reason through complex problems, analyze images, and generate images -- on one consumer GPU -- without manually restarting anything.
This is the story of how we built that, and what we learned when the 150-second silent generation timeout taught us that coordination without observability is just coordination nobody trusts.
The Cast: What Actually Runs
1. Nemotron-Orchestrator-8B (vLLM TQ, port 8199)
The orchestrator. Every chat message, every agent dispatch, every tool call goes through this model. It is NVIDIA's Nemotron-Orchestrator-8B, quantized to AWQ 4-bit (cyankiwi/Nemotron-Orchestrator-8B-AWQ-4bit), served by our TurboQuant-patched vLLM fork.
The weights are ~4.5 GB in AWQ 4-bit. The KV cache uses TQ4 (TurboQuant 4-bit vector quantization), which compresses attention state to roughly 34 KB per token instead of the 128 KB you would get with FP16. At 40% GPU utilization, the orchestrator gets ~12.8 GB total -- enough for weights plus a TQ4 KV cache that can hold tens of thousands of tokens.
vLLM runs with --enable-auto-tool-choice --tool-call-parser hermes so the model handles structured tool calls natively. CUDA graphs are enabled via piecewise compilation (cudagraph_mode: piecewise), which gives us the graph-execution speedup without the rigid batch-size constraints of full CUDA graph capture.
This model never sleeps. It is always resident. Everything else works around it. It also handles effort 1--2 "reflex" tasks directly -- simple greetings, lookups, quick answers -- so there is no separate reflex model. One model, all chat tiers.
2. DeepSeek-R1-distill-Qwen-14B (vLLM TQ, port 8176)
The reasoning model. Handles effort 7--10 tasks: multi-step code refactoring, deep analysis, chain-of-thought problems. It is casperhansen/deepseek-r1-distill-qwen-14b-awq -- a 14B distillation of DeepSeek-R1 into the Qwen architecture, AWQ 4-bit quantized.
This runs locally on the same GPU, in a second vLLM instance (aither-vllm-deepseek), also with TQ4 KV cache. At 50% GPU utilization, it gets ~16 GB. The weights are ~7.5 GB in AWQ 4-bit, leaving the rest for KV cache.
The critical feature: sleep mode. The container runs with VLLM_SERVER_DEV_MODE=1, which exposes a POST /sleep endpoint. When MicroScheduler needs VRAM for image generation, it puts the reasoning model to sleep. vLLM offloads the model weights to DDR5 system RAM (~128 GB available). The VRAM is freed. When reasoning is needed again, weights reload from DDR5 in 5--10 seconds -- dramatically faster than the 30--60 second cold start from disk.
3. Gemma 4 26B-A4B (hot-swap alternative, port 8176)
The wildcard. Google's Gemma 4 is a 26B Mixture-of-Experts model with only 4B active parameters per token (cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit). It does both reasoning and vision -- multimodal in a single model. It is the hot-swap replacement for DeepSeek-R1 on the reasoning port.
Swapping is manual today:
docker compose stop aither-vllm-deepseek
docker compose up -d aither-vllm-gemma4
Both containers bind to the same aither-vllm-reasoning network alias on port 8176, so Genesis does not care which one is running. The served model name changes (deepseek-r1:14b vs gemma4-reasoning), but the routing is port-based.
Gemma 4 gets 55% GPU utilization with fp8_e4m3 KV cache (TQ4 is blocked by heterogeneous attention head sizes -- 256/512 -- that break vLLM's page-size unification). It needs transformers>=5.5.0, which we pip-install at container startup because vLLM 0.19 pins an older version.
When Gemma 4 is active, you get vision and reasoning from one model slot. No separate vision container needed.
4. ComfyUI (port 8188)
The image generation backend. ComfyUI manages its own model loading:
| Tier | Model | Steps | VRAM |
|---|---|---|---|
| Lightning | sdxl_lightning_4step | 4 | ~6 GB |
| Quality | waiIllustriousSDXL_v140 | 20--30 | ~12 GB |
| Ultra | flux1-dev-fp8 | 30--50 | ~20 GB |
ComfyUI has an internal LRU cache that keeps recently-used models in VRAM. Great for repeated generations. Catastrophic when the orchestrator and reasoning model already occupy 90% of the card.
5. nomic-embed-text-v1.5 (vLLM, port 8209)
A third vLLM instance dedicated to embeddings. nomic-ai/nomic-embed-text-v1.5 in FP16, 768-dimensional vectors, at just 5% GPU utilization (~300 MB VRAM). It serves CodeGraph, semantic search, the memory graph, and 28+ callers across all layers. Runs on stock vLLM (no TQ patch needed) with --max-num-seqs 64 for high-throughput batch embedding.
This replaced the old Ollama /api/embed path -- faster batching, no Ollama dependency, and the 5% GPU allocation is small enough to coexist with everything else.
The Budget: How 32 GB Gets Divided
Here is the actual VRAM allocation from .env and .DEPLOYMENT/compose/docker-compose.aitheros.yml:
Orchestrator (vLLM TQ): 40% = 12.8 GB (always resident)
Reasoning (vLLM TQ): 50% = 16.0 GB (sleepable to DDR5)
Embeddings (vLLM): 5% = 1.6 GB (~300 MB actual)
───────────────
Subtotal: 95% = 30.4 GB
System reserved: 4 GB (CUDA overhead)
ComfyUI: 0 GB allocated (steals from sleeping models)
The math does not add up. 95% + 4 GB = 34.4 GB on a 32 GB card. This works because:
- vLLM GPU utilization is a ceiling, not a reservation.
gpu-memory-utilization 0.40means "use up to 40%." At idle with no active sequences, actual usage is just the model weights (~4.5 GB for orchestrator, ~7.5 GB for reasoning, ~300 MB for embeddings). - Reasoning sleeps. When ComfyUI needs VRAM, reasoning offloads weights to DDR5. Its 16 GB allocation goes to zero.
- Embeddings are tiny. nomic-embed-text at 5% uses ~300 MB in practice. The 1.6 GB ceiling is never reached.
The steady-state during chat: orchestrator weights (~4.5 GB) + orchestrator KV cache (dynamic) + reasoning weights (~7.5 GB) + reasoning KV cache (dynamic) + embeddings (~300 MB) ≈ 13--20 GB depending on active sequences. Plenty of headroom for Lightning-tier image generation without sleeping anything.
MicroScheduler: The VRAM Traffic Cop
MicroScheduler (port 8150) coordinates VRAM across processes. Every image generation request goes through it. It runs a graduated VRAM reclamation sequence based on the image tier:
Quick Mode (Lightning/Turbo)
SDXL Lightning needs ~6 GB. MicroScheduler checks free VRAM. If the orchestrator and reasoning model have low KV cache occupancy, there is enough headroom already. Slot granted immediately. No models evicted. Zero latency tax.
Standard Mode (Quality)
Full Illustrious XL needs ~12 GB. MicroScheduler sends a sleep signal to the reasoning container. vLLM offloads DeepSeek-R1 weights from VRAM to DDR5 RAM. This frees ~7.5 GB of weights plus whatever KV cache was resident. Combined with existing headroom, that is enough for Illustrious alongside the orchestrator.
Full Mode (Ultra/Flux)
Flux needs ~20 GB. MicroScheduler sleeps reasoning and signals the orchestrator to drop its KV cache to minimum. This clears roughly 20+ GB -- enough for Flux with room for ComfyUI overhead. The orchestrator weights stay resident (they are only 4.5 GB) and the embedding model is negligible (~300 MB), so chat and search still work during generation, just without cached context.
The key insight: most image generations are Lightning tier. A user asking "draw me a cat" does not need Flux. Quick mode adds zero latency. Only the heavy tiers pay the model-swap cost. And when they do, the user sees exactly what is happening thanks to the streaming pipeline.
Lazy Loading and Hot-Swap
When ComfyUI gets a generation request, it may need to load a diffusion model from disk. First-load times from NVMe:
| Model | Cold Load |
|---|---|
| SDXL Lightning 4-step | ~4 seconds |
| Illustrious XL v1.4 | ~8 seconds |
| Flux Dev FP8 | ~18 seconds |
After first load, ComfyUI's LRU cache keeps the model resident. Subsequent generations in the same tier are near-instant. But switching model families (SDXL → Flux) requires an explicit flush:
def _hot_swap_if_needed(self, model_name: str):
new_family = self._classify_model_family(model_name)
if self._current_model_family and self._current_model_family != new_family:
self._post("/free", json={"unload_models": True, "free_memory": True})
self._current_model_family = new_family
Without the explicit flush, ComfyUI's LRU tries to keep both model families resident simultaneously. On a card where the orchestrator already has 4.5 GB of weights, there is no room for SDXL + Flux. The flush prevents the OOM.
Reasoning model reload is similar but faster. vLLM sleep mode keeps weights in DDR5 RAM (128 GB available). Reload from DDR5 takes 5--10 seconds. Cold start from disk would be 30--60 seconds. The DDR5 cache is the difference between "brief pause" and "feels broken."
The 150-Second Black Hole (And How We Fixed It)
Here is what the user saw before today:
Pipeline
[done] Classify intent
[done] Preparing generation 0.1s
[done] Loading model to GPU 0.1s
[active] Generating image 154s...
Three events in 100 milliseconds. Then silence for two and a half minutes. The chat engine was emitting fake progress events before calling the Canvas service. The actual work -- VRAM sleep, ComfyUI health check, safety filtering, prompt enhancement, model cold load, diffusion sampling -- happened inside a single blocking HTTP POST. Zero intermediate feedback.
The Fix: Streaming Progress via NDJSON
We added /generate/stream to the Canvas service. It returns NDJSON with a progress event at every phase. This is real output from a Lightning generation:
{"phase":"health_check", "elapsed_s":0.0, "detail":"Checking ComfyUI status"}
{"phase":"vram_acquire", "elapsed_s":0.0, "detail":"Requesting VRAM slot (mode=quick)"}
{"phase":"vram_acquire", "elapsed_s":0.1, "detail":"VRAM acquired: 1127MB free (mode=quick)"}
{"phase":"safety_filter", "elapsed_s":0.1, "detail":"Safety: unrestricted"}
{"phase":"prompt_enhance", "elapsed_s":0.1, "detail":"Skipped (fast tier)"}
{"phase":"model_swap", "elapsed_s":0.1, "detail":"Loading sdxl"}
{"phase":"workflow_build", "elapsed_s":0.1, "detail":"Building lightning workflow (1024x1024)"}
{"phase":"queued", "elapsed_s":0.1, "detail":"Submitted to ComfyUI"}
{"phase":"sampling", "elapsed_s":1.8, "step":1, "total":4, "percent":25}
{"phase":"sampling", "elapsed_s":2.0, "step":3, "total":4, "percent":75}
{"phase":"sampling", "elapsed_s":2.1, "step":4, "total":4, "percent":100}
{"phase":"encoding", "elapsed_s":3.1, "detail":"Encoding images to base64"}
{"phase":"done", "elapsed_s":3.1, "detail":"1 image(s) generated"}
The Canvas service wraps its entire pipeline in an async generator. Each phase yields a JSON line before starting the next operation. The CanvasClient streams these and forwards each to the chat engine's SSE callback, which pushes it to the frontend in real time.
ComfyUI sends progress messages over WebSocket with {value, max} for each sampling step. We bridge sync WebSocket to async streaming via a thread + queue:
# Thread: WebSocket → queue
if msg_type == 'progress':
progress_q.put({"phase": "sampling", "step": val, "total": mx,
"percent": round(val / mx * 100)})
# Async generator: queue → NDJSON
while not fut.done():
msg = progress_q.get(timeout=0.5)
if msg["phase"] == "sampling" and pct - last >= 10:
yield emit("sampling", **msg)
Sampling events throttled to every 10% -- a 50-step quality generation does not need 50 SSE events.
For a quality-tier cold-start generation, the user now sees:
[0.0s] Checking ComfyUI status
[0.1s] Requesting VRAM slot (mode=standard)
[3.2s] Sleeping reasoning model...
[8.1s] VRAM acquired: 18432MB free
[8.2s] Checking content safety
[8.3s] Enhancing prompt with LLM
[12.1s] Building quality workflow (1024x1024)
[12.2s] Submitted to ComfyUI
[20.5s] Sampling 10%
[22.3s] Sampling 50%
[24.1s] Sampling 90%
[25.8s] Encoding images
[26.0s] 1 image generated
Every second accounted for. No black holes.
The Effort Router
Not every request needs the same model. The EffortScaler in AgentForge classifies each request before it touches the GPU:
| Effort | Model | Where | Use Case |
|---|---|---|---|
| 1--6 | Nemotron-Orchestrator-8B | vLLM TQ :8199 | Chat, reflex, agents, tool calling |
| 7--10 | DeepSeek-R1-14B or Gemma 4 | vLLM TQ :8176 | Reasoning, complex analysis |
| Vision | Gemma 4 26B-A4B | vLLM :8176 | Image understanding (multimodal) |
| Embed | nomic-embed-text-v1.5 | vLLM :8209 | Semantic search, CodeGraph, RAG |
| Image | SDXL / Illustrious / Flux | ComfyUI :8188 | Image generation |
The orchestrator handles 80%+ of all calls -- from simple greetings (effort 1) through full agent dispatch with tool calling (effort 6). There is no separate reflex model; the orchestrator is fast enough at AWQ 4-bit with TQ4 KV to handle everything below effort 7. When a task needs reasoning, the orchestrator invokes the deep_reasoning tool, which routes to the reasoning model on port 8176. The orchestrator stays in the loop as dispatcher; the reasoning model does the heavy thinking.
Gemma 4 is the interesting one. As a multimodal MoE model, it handles both vision and reasoning in a single model slot. When it is active on port 8176 (instead of DeepSeek-R1), image understanding comes for free -- no separate vision container, no VRAM zone swap. The trade-off is that DeepSeek-R1 is a stronger pure-reasoning model, so the choice depends on workload. Heavy code analysis? DeepSeek. Mixed vision + reasoning? Gemma 4.
The Numbers
On the RTX 5090 with all capabilities available:
| Metric | Value |
|---|---|
| Orchestrator (Nemotron-8B AWQ 4-bit) | ~42 tok/s (TQ4 KV) |
| Orchestrator weights | ~4.5 GB |
| Reasoning (DeepSeek-R1-14B AWQ 4-bit) | ~30 tok/s (TQ4 KV) |
| Reasoning weights | ~7.5 GB |
| Reasoning sleep → DDR5 offload | ~2 seconds |
| Reasoning wake from DDR5 | 5--10 seconds |
| Lightning image gen (cached model) | 2--4 seconds |
| Quality image gen (cold model load) | 15--30 seconds |
| Ultra image gen (Flux + VRAM sleep) | 45--120 seconds |
| VRAM swap latency | 0s quick / 5s standard / 10--15s full |
The critical insight: most requests never trigger a VRAM swap. Chat stays on the always-resident orchestrator. Reasoning runs on its own 50% allocation. Lightning images fit in the headroom between actual usage and allocation ceiling. Only quality/ultra generation and model-family switches require actual juggling. The graduated preemption means the common case is free and the expensive case is rare.
What We Learned
1. AWQ 4-bit + TQ4 KV cache is the enabler. Without 4-bit quantization, an 8B model takes ~16 GB in FP16. A 14B model takes ~28 GB. You cannot run both on 32 GB, let alone leave room for image generation. AWQ 4-bit cuts weights to ~4.5 GB and ~7.5 GB respectively. TurboQuant cuts KV cache by 4x. The compression is what makes the entire juggling act possible.
2. Sleep mode is the secret weapon. Binary eviction (load/unload from disk) is too slow. DDR5 offload via vLLM sleep mode takes 2 seconds down and 5--10 seconds up. With 128 GB of system RAM, we can keep multiple models "warm" in DDR5 even when they are not in VRAM. It turns a 60-second cold start into a 5-second warm resume.
3. Observability is not optional for long operations. A 3-second Lightning generation does not need a progress bar. A 120-second Flux generation absolutely does. The streaming pipeline was the difference between "the system is broken" and "the system is sleeping the reasoning model, currently at 60% sampling."
4. Graduated eviction beats binary eviction. Our first implementation had two modes: everything loaded, or everything evicted. Every image generation, even Lightning, triggered a full VRAM flush. A 2-second generation had a 15-second eviction/reload tax. The three-tier approach (quick/standard/full) eliminated the tax for 80% of requests.
5. MoE models change the game. Gemma 4's 26B total / 4B active architecture means you get a "26B-class" model at 8B-class VRAM cost. And because it is multimodal, one model slot serves both reasoning and vision. The per-model-slot capability density is going up fast. The juggling act gets easier when each model does more.
Architecture
User Request
│
▼
┌──────────────────────────┐
│ Chat Engine │ Classifies intent + effort
│ (Genesis:8001) │
└──────────┬───────────────┘
│
┌─────┴──────────────────────────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────┐ ┌────────────┐ ┌──────────┐ ┌──────────────┐
│ E 1-2 │ │ E 3-6 │ │ E 7-10 │ │ Image Gen │
│ Ollama │ │ Nemotron │ │ DeepSeek │ │ Canvas:8108 │
│ reflex │ │ Orch 8B │ │ R1 14B │ └──────┬───────┘
│ (CPU+GPU)│ │ AWQ 4-bit │ │ AWQ 4-bit│ │
└──────────┘ │ TQ4 KV │ │ TQ4 KV │ ┌──────┴───────┐
│ vLLM:8199 │ │ vLLM:8176│ │MicroScheduler│
│ 40% GPU │ │ 50% GPU │ │ :8150 │
└────────────┘ │ (sleeps │ │ VRAM slot │
│ to DDR5) │ │ quick/std/ful│
┌─ OR ──▶ └──────────┘ └──────┬───────┘
│ │
┌──────────┐ ┌───────┴──────┐
│ Gemma 4 │ │ ComfyUI:8188 │
│ 26B MoE │ │ SDXL / Flux │
│ AWQ 4-bit│ │ WS progress │
│ Vision + │ └──────────────┘
│ Reasoning│
│ vLLM:8176│
│ 55% GPU │
└──────────┘
┌──────────────────────┐
│ RTX 5090 (32 GB) │
│ ┌──────┐ ┌─────────┐│
│ │ Orch │ │Reasoning││
│ │ 40% │ │ 50% ││
│ │always│ │sleepable││
│ └──────┘ └─────────┘│
│ ComfyUI steals from │
│ sleeping models │
└──────────────────────┘
The entire system runs on a single workstation: AMD Ryzen 9 9950X3D (16 cores), 128 GB DDR5, RTX 5090 (32 GB). No cloud dependency. Two vLLM instances share the GPU with sleep-mode coordination. ComfyUI gets VRAM by putting models to bed. Every request gets the right model, at the right time, with the right amount of VRAM -- and now, with real-time visibility into every phase of the process.
The code is open. The streaming pipeline shipped today.