Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

architecturegpuvraminferenceimage-generationcomfyuivllmturboquantdeep-dive

One GPU, Four Models: How AitherOS Juggles 32 GB of VRAM

April 16, 202614 min readDavid Parkhurst

One GPU, Four Models: How AitherOS Juggles 32 GB of VRAM

April 16, 2026 -- David Parkhurst

The RTX 5090 has 32 GB of VRAM. Here is what AitherOS runs on it:

Model	Quant	Purpose	GPU Util	VRAM Budget
Nemotron-Orchestrator-8B	AWQ 4-bit + TQ4 KV	Chat, routing, tool use, reflex	40%	~12.8 GB
DeepSeek-R1-distill-Qwen-14B	AWQ 4-bit + TQ4 KV	Reasoning (effort 7--10)	50%	~16 GB
nomic-embed-text-v1.5	FP16	Embeddings (CodeGraph, semantic search)	5%	~300 MB
SDXL / Illustrious / Flux	Native weights	Image generation (ComfyUI)	dynamic	6--20 GB

Three vLLM instances share the GPU: orchestrator at 40%, reasoning at 50%, embeddings at 5%. That is 95% allocated before ComfyUI even thinks about loading a diffusion model. And Gemma 4 -- the 26B MoE reasoning/vision model waiting in the wings as a hot-swap alternative to DeepSeek -- wants 55% by itself.

The naive solution is to pick one model and accept the constraints. We chose the other path: run them all, but coordinate who is awake and who is asleep, with a scheduler that reclaims VRAM on demand. The result is a system where users can chat, reason through complex problems, analyze images, and generate images -- on one consumer GPU -- without manually restarting anything.

This is the story of how we built that, and what we learned when the 150-second silent generation timeout taught us that coordination without observability is just coordination nobody trusts.

The Cast: What Actually Runs

1. Nemotron-Orchestrator-8B (vLLM TQ, port 8199)

The orchestrator. Every chat message, every agent dispatch, every tool call goes through this model. It is NVIDIA's Nemotron-Orchestrator-8B, quantized to AWQ 4-bit (cyankiwi/Nemotron-Orchestrator-8B-AWQ-4bit), served by our TurboQuant-patched vLLM fork.

The weights are ~4.5 GB in AWQ 4-bit. The KV cache uses TQ4 (TurboQuant 4-bit vector quantization), which compresses attention state to roughly 34 KB per token instead of the 128 KB you would get with FP16. At 40% GPU utilization, the orchestrator gets ~12.8 GB total -- enough for weights plus a TQ4 KV cache that can hold tens of thousands of tokens.

vLLM runs with --enable-auto-tool-choice --tool-call-parser hermes so the model handles structured tool calls natively. CUDA graphs are enabled via piecewise compilation (cudagraph_mode: piecewise), which gives us the graph-execution speedup without the rigid batch-size constraints of full CUDA graph capture.

This model never sleeps. It is always resident. Everything else works around it. It also handles effort 1--2 "reflex" tasks directly -- simple greetings, lookups, quick answers -- so there is no separate reflex model. One model, all chat tiers.

2. DeepSeek-R1-distill-Qwen-14B (vLLM TQ, port 8176)

The reasoning model. Handles effort 7--10 tasks: multi-step code refactoring, deep analysis, chain-of-thought problems. It is casperhansen/deepseek-r1-distill-qwen-14b-awq -- a 14B distillation of DeepSeek-R1 into the Qwen architecture, AWQ 4-bit quantized.

This runs locally on the same GPU, in a second vLLM instance (aither-vllm-deepseek), also with TQ4 KV cache. At 50% GPU utilization, it gets ~16 GB. The weights are ~7.5 GB in AWQ 4-bit, leaving the rest for KV cache.

The critical feature: sleep mode. The container runs with VLLM_SERVER_DEV_MODE=1, which exposes a POST /sleep endpoint. When MicroScheduler needs VRAM for image generation, it puts the reasoning model to sleep. vLLM offloads the model weights to DDR5 system RAM (~128 GB available). The VRAM is freed. When reasoning is needed again, weights reload from DDR5 in 5--10 seconds -- dramatically faster than the 30--60 second cold start from disk.

3. Gemma 4 26B-A4B (hot-swap alternative, port 8176)

The wildcard. Google's Gemma 4 is a 26B Mixture-of-Experts model with only 4B active parameters per token (cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit). It does both reasoning and vision -- multimodal in a single model. It is the hot-swap replacement for DeepSeek-R1 on the reasoning port.

Swapping is manual today:

docker compose stop aither-vllm-deepseek
docker compose up -d aither-vllm-gemma4

Both containers bind to the same aither-vllm-reasoning network alias on port 8176, so Genesis does not care which one is running. The served model name changes (deepseek-r1:14b vs gemma4-reasoning), but the routing is port-based.

Gemma 4 gets 55% GPU utilization with fp8_e4m3 KV cache (TQ4 is blocked by heterogeneous attention head sizes -- 256/512 -- that break vLLM's page-size unification). It needs transformers>=5.5.0, which we pip-install at container startup because vLLM 0.19 pins an older version.

When Gemma 4 is active, you get vision and reasoning from one model slot. No separate vision container needed.

4. ComfyUI (port 8188)

The image generation backend. ComfyUI manages its own model loading:

Tier	Model	Steps	VRAM
Lightning	sdxl_lightning_4step	4	~6 GB
Quality	waiIllustriousSDXL_v140	20--30	~12 GB
Ultra	flux1-dev-fp8	30--50	~20 GB

ComfyUI has an internal LRU cache that keeps recently-used models in VRAM. Great for repeated generations. Catastrophic when the orchestrator and reasoning model already occupy 90% of the card.

5. nomic-embed-text-v1.5 (vLLM, port 8209)

A third vLLM instance dedicated to embeddings. nomic-ai/nomic-embed-text-v1.5 in FP16, 768-dimensional vectors, at just 5% GPU utilization (~300 MB VRAM). It serves CodeGraph, semantic search, the memory graph, and 28+ callers across all layers. Runs on stock vLLM (no TQ patch needed) with --max-num-seqs 64 for high-throughput batch embedding.

This replaced the old Ollama /api/embed path -- faster batching, no Ollama dependency, and the 5% GPU allocation is small enough to coexist with everything else.

The Budget: How 32 GB Gets Divided

Here is the actual VRAM allocation from .env and .DEPLOYMENT/compose/docker-compose.aitheros.yml:

Orchestrator (vLLM TQ):    40%  =  12.8 GB  (always resident)
Reasoning (vLLM TQ):       50%  =  16.0 GB  (sleepable to DDR5)
Embeddings (vLLM):          5%  =   1.6 GB  (~300 MB actual)
                           ───────────────
Subtotal:                  95%  =  30.4 GB

System reserved:           4 GB              (CUDA overhead)
ComfyUI:                   0 GB allocated    (steals from sleeping models)

The math does not add up. 95% + 4 GB = 34.4 GB on a 32 GB card. This works because:

vLLM GPU utilization is a ceiling, not a reservation. gpu-memory-utilization 0.40 means "use up to 40%." At idle with no active sequences, actual usage is just the model weights (~4.5 GB for orchestrator, ~7.5 GB for reasoning, ~300 MB for embeddings).
Reasoning sleeps. When ComfyUI needs VRAM, reasoning offloads weights to DDR5. Its 16 GB allocation goes to zero.
Embeddings are tiny. nomic-embed-text at 5% uses ~300 MB in practice. The 1.6 GB ceiling is never reached.

The steady-state during chat: orchestrator weights (~4.5 GB) + orchestrator KV cache (dynamic) + reasoning weights (~7.5 GB) + reasoning KV cache (dynamic) + embeddings (~300 MB) ≈ 13--20 GB depending on active sequences. Plenty of headroom for Lightning-tier image generation without sleeping anything.

MicroScheduler: The VRAM Traffic Cop

MicroScheduler (port 8150) coordinates VRAM across processes. Every image generation request goes through it. It runs a graduated VRAM reclamation sequence based on the image tier:

Quick Mode (Lightning/Turbo)

SDXL Lightning needs ~6 GB. MicroScheduler checks free VRAM. If the orchestrator and reasoning model have low KV cache occupancy, there is enough headroom already. Slot granted immediately. No models evicted. Zero latency tax.

Standard Mode (Quality)

Full Illustrious XL needs ~12 GB. MicroScheduler sends a sleep signal to the reasoning container. vLLM offloads DeepSeek-R1 weights from VRAM to DDR5 RAM. This frees ~7.5 GB of weights plus whatever KV cache was resident. Combined with existing headroom, that is enough for Illustrious alongside the orchestrator.

Full Mode (Ultra/Flux)

Flux needs ~20 GB. MicroScheduler sleeps reasoning and signals the orchestrator to drop its KV cache to minimum. This clears roughly 20+ GB -- enough for Flux with room for ComfyUI overhead. The orchestrator weights stay resident (they are only 4.5 GB) and the embedding model is negligible (~300 MB), so chat and search still work during generation, just without cached context.

The key insight: most image generations are Lightning tier. A user asking "draw me a cat" does not need Flux. Quick mode adds zero latency. Only the heavy tiers pay the model-swap cost. And when they do, the user sees exactly what is happening thanks to the streaming pipeline.

Lazy Loading and Hot-Swap

When ComfyUI gets a generation request, it may need to load a diffusion model from disk. First-load times from NVMe:

Model	Cold Load
SDXL Lightning 4-step	~4 seconds
Illustrious XL v1.4	~8 seconds
Flux Dev FP8	~18 seconds

After first load, ComfyUI's LRU cache keeps the model resident. Subsequent generations in the same tier are near-instant. But switching model families (SDXL → Flux) requires an explicit flush:

def _hot_swap_if_needed(self, model_name: str):
    new_family = self._classify_model_family(model_name)
    if self._current_model_family and self._current_model_family != new_family:
        self._post("/free", json={"unload_models": True, "free_memory": True})
    self._current_model_family = new_family

Without the explicit flush, ComfyUI's LRU tries to keep both model families resident simultaneously. On a card where the orchestrator already has 4.5 GB of weights, there is no room for SDXL + Flux. The flush prevents the OOM.

Reasoning model reload is similar but faster. vLLM sleep mode keeps weights in DDR5 RAM (128 GB available). Reload from DDR5 takes 5--10 seconds. Cold start from disk would be 30--60 seconds. The DDR5 cache is the difference between "brief pause" and "feels broken."

The 150-Second Black Hole (And How We Fixed It)

Here is what the user saw before today:

Pipeline
  [done]   Classify intent
  [done]   Preparing generation      0.1s
  [done]   Loading model to GPU      0.1s
  [active] Generating image          154s...

Three events in 100 milliseconds. Then silence for two and a half minutes. The chat engine was emitting fake progress events before calling the Canvas service. The actual work -- VRAM sleep, ComfyUI health check, safety filtering, prompt enhancement, model cold load, diffusion sampling -- happened inside a single blocking HTTP POST. Zero intermediate feedback.

The Fix: Streaming Progress via NDJSON

We added /generate/stream to the Canvas service. It returns NDJSON with a progress event at every phase. This is real output from a Lightning generation:

{"phase":"health_check",   "elapsed_s":0.0, "detail":"Checking ComfyUI status"}
{"phase":"vram_acquire",   "elapsed_s":0.0, "detail":"Requesting VRAM slot (mode=quick)"}
{"phase":"vram_acquire",   "elapsed_s":0.1, "detail":"VRAM acquired: 1127MB free (mode=quick)"}
{"phase":"safety_filter",  "elapsed_s":0.1, "detail":"Safety: unrestricted"}
{"phase":"prompt_enhance", "elapsed_s":0.1, "detail":"Skipped (fast tier)"}
{"phase":"model_swap",     "elapsed_s":0.1, "detail":"Loading sdxl"}
{"phase":"workflow_build",  "elapsed_s":0.1, "detail":"Building lightning workflow (1024x1024)"}
{"phase":"queued",         "elapsed_s":0.1, "detail":"Submitted to ComfyUI"}
{"phase":"sampling",       "elapsed_s":1.8, "step":1, "total":4, "percent":25}
{"phase":"sampling",       "elapsed_s":2.0, "step":3, "total":4, "percent":75}
{"phase":"sampling",       "elapsed_s":2.1, "step":4, "total":4, "percent":100}
{"phase":"encoding",       "elapsed_s":3.1, "detail":"Encoding images to base64"}
{"phase":"done",           "elapsed_s":3.1, "detail":"1 image(s) generated"}

The Canvas service wraps its entire pipeline in an async generator. Each phase yields a JSON line before starting the next operation. The CanvasClient streams these and forwards each to the chat engine's SSE callback, which pushes it to the frontend in real time.

ComfyUI sends progress messages over WebSocket with {value, max} for each sampling step. We bridge sync WebSocket to async streaming via a thread + queue:

# Thread: WebSocket → queue
if msg_type == 'progress':
    progress_q.put({"phase": "sampling", "step": val, "total": mx,
                    "percent": round(val / mx * 100)})

# Async generator: queue → NDJSON
while not fut.done():
    msg = progress_q.get(timeout=0.5)
    if msg["phase"] == "sampling" and pct - last >= 10:
        yield emit("sampling", **msg)

Sampling events throttled to every 10% -- a 50-step quality generation does not need 50 SSE events.

For a quality-tier cold-start generation, the user now sees:

[0.0s]  Checking ComfyUI status
[0.1s]  Requesting VRAM slot (mode=standard)
[3.2s]  Sleeping reasoning model...
[8.1s]  VRAM acquired: 18432MB free
[8.2s]  Checking content safety
[8.3s]  Enhancing prompt with LLM
[12.1s] Building quality workflow (1024x1024)
[12.2s] Submitted to ComfyUI
[20.5s] Sampling 10%
[22.3s] Sampling 50%
[24.1s] Sampling 90%
[25.8s] Encoding images
[26.0s] 1 image generated

Every second accounted for. No black holes.

The Effort Router

Not every request needs the same model. The EffortScaler in AgentForge classifies each request before it touches the GPU:

Effort	Model	Where	Use Case
1--6	Nemotron-Orchestrator-8B	vLLM TQ :8199	Chat, reflex, agents, tool calling
7--10	DeepSeek-R1-14B or Gemma 4	vLLM TQ :8176	Reasoning, complex analysis
Vision	Gemma 4 26B-A4B	vLLM :8176	Image understanding (multimodal)
Embed	nomic-embed-text-v1.5	vLLM :8209	Semantic search, CodeGraph, RAG
Image	SDXL / Illustrious / Flux	ComfyUI :8188	Image generation

The orchestrator handles 80%+ of all calls -- from simple greetings (effort 1) through full agent dispatch with tool calling (effort 6). There is no separate reflex model; the orchestrator is fast enough at AWQ 4-bit with TQ4 KV to handle everything below effort 7. When a task needs reasoning, the orchestrator invokes the deep_reasoning tool, which routes to the reasoning model on port 8176. The orchestrator stays in the loop as dispatcher; the reasoning model does the heavy thinking.

Gemma 4 is the interesting one. As a multimodal MoE model, it handles both vision and reasoning in a single model slot. When it is active on port 8176 (instead of DeepSeek-R1), image understanding comes for free -- no separate vision container, no VRAM zone swap. The trade-off is that DeepSeek-R1 is a stronger pure-reasoning model, so the choice depends on workload. Heavy code analysis? DeepSeek. Mixed vision + reasoning? Gemma 4.

The Numbers

On the RTX 5090 with all capabilities available:

Metric	Value
Orchestrator (Nemotron-8B AWQ 4-bit)	~42 tok/s (TQ4 KV)
Orchestrator weights	~4.5 GB
Reasoning (DeepSeek-R1-14B AWQ 4-bit)	~30 tok/s (TQ4 KV)
Reasoning weights	~7.5 GB
Reasoning sleep → DDR5 offload	~2 seconds
Reasoning wake from DDR5	5--10 seconds
Lightning image gen (cached model)	2--4 seconds
Quality image gen (cold model load)	15--30 seconds
Ultra image gen (Flux + VRAM sleep)	45--120 seconds
VRAM swap latency	0s quick / 5s standard / 10--15s full

The critical insight: most requests never trigger a VRAM swap. Chat stays on the always-resident orchestrator. Reasoning runs on its own 50% allocation. Lightning images fit in the headroom between actual usage and allocation ceiling. Only quality/ultra generation and model-family switches require actual juggling. The graduated preemption means the common case is free and the expensive case is rare.

What We Learned

1. AWQ 4-bit + TQ4 KV cache is the enabler. Without 4-bit quantization, an 8B model takes ~16 GB in FP16. A 14B model takes ~28 GB. You cannot run both on 32 GB, let alone leave room for image generation. AWQ 4-bit cuts weights to ~4.5 GB and ~7.5 GB respectively. TurboQuant cuts KV cache by 4x. The compression is what makes the entire juggling act possible.

2. Sleep mode is the secret weapon. Binary eviction (load/unload from disk) is too slow. DDR5 offload via vLLM sleep mode takes 2 seconds down and 5--10 seconds up. With 128 GB of system RAM, we can keep multiple models "warm" in DDR5 even when they are not in VRAM. It turns a 60-second cold start into a 5-second warm resume.

3. Observability is not optional for long operations. A 3-second Lightning generation does not need a progress bar. A 120-second Flux generation absolutely does. The streaming pipeline was the difference between "the system is broken" and "the system is sleeping the reasoning model, currently at 60% sampling."

4. Graduated eviction beats binary eviction. Our first implementation had two modes: everything loaded, or everything evicted. Every image generation, even Lightning, triggered a full VRAM flush. A 2-second generation had a 15-second eviction/reload tax. The three-tier approach (quick/standard/full) eliminated the tax for 80% of requests.

5. MoE models change the game. Gemma 4's 26B total / 4B active architecture means you get a "26B-class" model at 8B-class VRAM cost. And because it is multimodal, one model slot serves both reasoning and vision. The per-model-slot capability density is going up fast. The juggling act gets easier when each model does more.

Architecture

User Request
    │
    ▼
┌──────────────────────────┐
│       Chat Engine         │  Classifies intent + effort
│      (Genesis:8001)       │
└──────────┬───────────────┘
           │
     ┌─────┴──────────────────────────────────────┐
     │              │              │               │
     ▼              ▼              ▼               ▼
┌──────────┐ ┌────────────┐ ┌──────────┐  ┌──────────────┐
│ E 1-2    │ │ E 3-6      │ │ E 7-10   │  │ Image Gen    │
│ Ollama   │ │ Nemotron   │ │ DeepSeek │  │ Canvas:8108  │
│ reflex   │ │ Orch 8B    │ │ R1 14B   │  └──────┬───────┘
│ (CPU+GPU)│ │ AWQ 4-bit  │ │ AWQ 4-bit│         │
└──────────┘ │ TQ4 KV     │ │ TQ4 KV   │  ┌──────┴───────┐
             │ vLLM:8199  │ │ vLLM:8176│  │MicroScheduler│
             │ 40% GPU    │ │ 50% GPU  │  │    :8150     │
             └────────────┘ │ (sleeps  │  │  VRAM slot   │
                            │ to DDR5) │  │ quick/std/ful│
                  ┌─ OR ──▶ └──────────┘  └──────┬───────┘
                  │                               │
           ┌──────────┐                   ┌───────┴──────┐
           │ Gemma 4  │                   │ ComfyUI:8188 │
           │ 26B MoE  │                   │ SDXL / Flux  │
           │ AWQ 4-bit│                   │ WS progress  │
           │ Vision + │                   └──────────────┘
           │ Reasoning│
           │ vLLM:8176│
           │ 55% GPU  │
           └──────────┘
                           ┌──────────────────────┐
                           │    RTX 5090 (32 GB)   │
                           │ ┌──────┐  ┌─────────┐│
                           │ │ Orch │  │Reasoning││
                           │ │ 40%  │  │  50%    ││
                           │ │always│  │sleepable││
                           │ └──────┘  └─────────┘│
                           │   ComfyUI steals from │
                           │   sleeping models     │
                           └──────────────────────┘

The entire system runs on a single workstation: AMD Ryzen 9 9950X3D (16 cores), 128 GB DDR5, RTX 5090 (32 GB). No cloud dependency. Two vLLM instances share the GPU with sleep-mode coordination. ComfyUI gets VRAM by putting models to bed. Every request gets the right model, at the right time, with the right amount of VRAM -- and now, with real-time visibility into every phase of the process.

The code is open. The streaming pipeline shipped today.

Enjoyed this post?

All posts Try AitherOS

Back to blog

architecturegpuvraminferenceimage-generationcomfyuivllmturboquantdeep-dive

One GPU, Four Models: How AitherOS Juggles 32 GB of VRAM

April 16, 202614 min readDavid Parkhurst

One GPU, Four Models: How AitherOS Juggles 32 GB of VRAM

April 16, 2026 -- David Parkhurst

The RTX 5090 has 32 GB of VRAM. Here is what AitherOS runs on it:

Model	Quant	Purpose	GPU Util	VRAM Budget
Nemotron-Orchestrator-8B	AWQ 4-bit + TQ4 KV	Chat, routing, tool use, reflex	40%	~12.8 GB
DeepSeek-R1-distill-Qwen-14B	AWQ 4-bit + TQ4 KV	Reasoning (effort 7--10)	50%	~16 GB
nomic-embed-text-v1.5	FP16	Embeddings (CodeGraph, semantic search)	5%	~300 MB
SDXL / Illustrious / Flux	Native weights	Image generation (ComfyUI)	dynamic	6--20 GB

This is the story of how we built that, and what we learned when the 150-second silent generation timeout taught us that coordination without observability is just coordination nobody trusts.

The Cast: What Actually Runs

1. Nemotron-Orchestrator-8B (vLLM TQ, port 8199)

2. DeepSeek-R1-distill-Qwen-14B (vLLM TQ, port 8176)

3. Gemma 4 26B-A4B (hot-swap alternative, port 8176)

Swapping is manual today:

docker compose stop aither-vllm-deepseek
docker compose up -d aither-vllm-gemma4

When Gemma 4 is active, you get vision and reasoning from one model slot. No separate vision container needed.

4. ComfyUI (port 8188)

The image generation backend. ComfyUI manages its own model loading:

Tier	Model	Steps	VRAM
Lightning	sdxl_lightning_4step	4	~6 GB
Quality	waiIllustriousSDXL_v140	20--30	~12 GB
Ultra	flux1-dev-fp8	30--50	~20 GB

ComfyUI has an internal LRU cache that keeps recently-used models in VRAM. Great for repeated generations. Catastrophic when the orchestrator and reasoning model already occupy 90% of the card.

5. nomic-embed-text-v1.5 (vLLM, port 8209)

This replaced the old Ollama /api/embed path -- faster batching, no Ollama dependency, and the 5% GPU allocation is small enough to coexist with everything else.

The Budget: How 32 GB Gets Divided

Here is the actual VRAM allocation from .env and .DEPLOYMENT/compose/docker-compose.aitheros.yml:

Orchestrator (vLLM TQ):    40%  =  12.8 GB  (always resident)
Reasoning (vLLM TQ):       50%  =  16.0 GB  (sleepable to DDR5)
Embeddings (vLLM):          5%  =   1.6 GB  (~300 MB actual)
                           ───────────────
Subtotal:                  95%  =  30.4 GB

System reserved:           4 GB              (CUDA overhead)
ComfyUI:                   0 GB allocated    (steals from sleeping models)

The math does not add up. 95% + 4 GB = 34.4 GB on a 32 GB card. This works because:

vLLM GPU utilization is a ceiling, not a reservation. gpu-memory-utilization 0.40 means "use up to 40%." At idle with no active sequences, actual usage is just the model weights (~4.5 GB for orchestrator, ~7.5 GB for reasoning, ~300 MB for embeddings).
Reasoning sleeps. When ComfyUI needs VRAM, reasoning offloads weights to DDR5. Its 16 GB allocation goes to zero.
Embeddings are tiny. nomic-embed-text at 5% uses ~300 MB in practice. The 1.6 GB ceiling is never reached.

MicroScheduler: The VRAM Traffic Cop

MicroScheduler (port 8150) coordinates VRAM across processes. Every image generation request goes through it. It runs a graduated VRAM reclamation sequence based on the image tier:

Quick Mode (Lightning/Turbo)

Standard Mode (Quality)

Full Mode (Ultra/Flux)

Lazy Loading and Hot-Swap

When ComfyUI gets a generation request, it may need to load a diffusion model from disk. First-load times from NVMe:

Model	Cold Load
SDXL Lightning 4-step	~4 seconds
Illustrious XL v1.4	~8 seconds
Flux Dev FP8	~18 seconds

After first load, ComfyUI's LRU cache keeps the model resident. Subsequent generations in the same tier are near-instant. But switching model families (SDXL → Flux) requires an explicit flush:

def _hot_swap_if_needed(self, model_name: str):
    new_family = self._classify_model_family(model_name)
    if self._current_model_family and self._current_model_family != new_family:
        self._post("/free", json={"unload_models": True, "free_memory": True})
    self._current_model_family = new_family

The 150-Second Black Hole (And How We Fixed It)

Here is what the user saw before today:

Pipeline
  [done]   Classify intent
  [done]   Preparing generation      0.1s
  [done]   Loading model to GPU      0.1s
  [active] Generating image          154s...

The Fix: Streaming Progress via NDJSON

We added /generate/stream to the Canvas service. It returns NDJSON with a progress event at every phase. This is real output from a Lightning generation:

{"phase":"health_check",   "elapsed_s":0.0, "detail":"Checking ComfyUI status"}
{"phase":"vram_acquire",   "elapsed_s":0.0, "detail":"Requesting VRAM slot (mode=quick)"}
{"phase":"vram_acquire",   "elapsed_s":0.1, "detail":"VRAM acquired: 1127MB free (mode=quick)"}
{"phase":"safety_filter",  "elapsed_s":0.1, "detail":"Safety: unrestricted"}
{"phase":"prompt_enhance", "elapsed_s":0.1, "detail":"Skipped (fast tier)"}
{"phase":"model_swap",     "elapsed_s":0.1, "detail":"Loading sdxl"}
{"phase":"workflow_build",  "elapsed_s":0.1, "detail":"Building lightning workflow (1024x1024)"}
{"phase":"queued",         "elapsed_s":0.1, "detail":"Submitted to ComfyUI"}
{"phase":"sampling",       "elapsed_s":1.8, "step":1, "total":4, "percent":25}
{"phase":"sampling",       "elapsed_s":2.0, "step":3, "total":4, "percent":75}
{"phase":"sampling",       "elapsed_s":2.1, "step":4, "total":4, "percent":100}
{"phase":"encoding",       "elapsed_s":3.1, "detail":"Encoding images to base64"}
{"phase":"done",           "elapsed_s":3.1, "detail":"1 image(s) generated"}

ComfyUI sends progress messages over WebSocket with {value, max} for each sampling step. We bridge sync WebSocket to async streaming via a thread + queue:

# Thread: WebSocket → queue
if msg_type == 'progress':
    progress_q.put({"phase": "sampling", "step": val, "total": mx,
                    "percent": round(val / mx * 100)})

# Async generator: queue → NDJSON
while not fut.done():
    msg = progress_q.get(timeout=0.5)
    if msg["phase"] == "sampling" and pct - last >= 10:
        yield emit("sampling", **msg)

Sampling events throttled to every 10% -- a 50-step quality generation does not need 50 SSE events.

For a quality-tier cold-start generation, the user now sees:

[0.0s]  Checking ComfyUI status
[0.1s]  Requesting VRAM slot (mode=standard)
[3.2s]  Sleeping reasoning model...
[8.1s]  VRAM acquired: 18432MB free
[8.2s]  Checking content safety
[8.3s]  Enhancing prompt with LLM
[12.1s] Building quality workflow (1024x1024)
[12.2s] Submitted to ComfyUI
[20.5s] Sampling 10%
[22.3s] Sampling 50%
[24.1s] Sampling 90%
[25.8s] Encoding images
[26.0s] 1 image generated

Every second accounted for. No black holes.

The Effort Router

Not every request needs the same model. The EffortScaler in AgentForge classifies each request before it touches the GPU:

Effort	Model	Where	Use Case
1--6	Nemotron-Orchestrator-8B	vLLM TQ :8199	Chat, reflex, agents, tool calling
7--10	DeepSeek-R1-14B or Gemma 4	vLLM TQ :8176	Reasoning, complex analysis
Vision	Gemma 4 26B-A4B	vLLM :8176	Image understanding (multimodal)
Embed	nomic-embed-text-v1.5	vLLM :8209	Semantic search, CodeGraph, RAG
Image	SDXL / Illustrious / Flux	ComfyUI :8188	Image generation

The Numbers

On the RTX 5090 with all capabilities available:

Metric	Value
Orchestrator (Nemotron-8B AWQ 4-bit)	~42 tok/s (TQ4 KV)
Orchestrator weights	~4.5 GB
Reasoning (DeepSeek-R1-14B AWQ 4-bit)	~30 tok/s (TQ4 KV)
Reasoning weights	~7.5 GB
Reasoning sleep → DDR5 offload	~2 seconds
Reasoning wake from DDR5	5--10 seconds
Lightning image gen (cached model)	2--4 seconds
Quality image gen (cold model load)	15--30 seconds
Ultra image gen (Flux + VRAM sleep)	45--120 seconds
VRAM swap latency	0s quick / 5s standard / 10--15s full

What We Learned

Architecture

User Request
    │
    ▼
┌──────────────────────────┐
│       Chat Engine         │  Classifies intent + effort
│      (Genesis:8001)       │
└──────────┬───────────────┘
           │
     ┌─────┴──────────────────────────────────────┐
     │              │              │               │
     ▼              ▼              ▼               ▼
┌──────────┐ ┌────────────┐ ┌──────────┐  ┌──────────────┐
│ E 1-2    │ │ E 3-6      │ │ E 7-10   │  │ Image Gen    │
│ Ollama   │ │ Nemotron   │ │ DeepSeek │  │ Canvas:8108  │
│ reflex   │ │ Orch 8B    │ │ R1 14B   │  └──────┬───────┘
│ (CPU+GPU)│ │ AWQ 4-bit  │ │ AWQ 4-bit│         │
└──────────┘ │ TQ4 KV     │ │ TQ4 KV   │  ┌──────┴───────┐
             │ vLLM:8199  │ │ vLLM:8176│  │MicroScheduler│
             │ 40% GPU    │ │ 50% GPU  │  │    :8150     │
             └────────────┘ │ (sleeps  │  │  VRAM slot   │
                            │ to DDR5) │  │ quick/std/ful│
                  ┌─ OR ──▶ └──────────┘  └──────┬───────┘
                  │                               │
           ┌──────────┐                   ┌───────┴──────┐
           │ Gemma 4  │                   │ ComfyUI:8188 │
           │ 26B MoE  │                   │ SDXL / Flux  │
           │ AWQ 4-bit│                   │ WS progress  │
           │ Vision + │                   └──────────────┘
           │ Reasoning│
           │ vLLM:8176│
           │ 55% GPU  │
           └──────────┘
                           ┌──────────────────────┐
                           │    RTX 5090 (32 GB)   │
                           │ ┌──────┐  ┌─────────┐│
                           │ │ Orch │  │Reasoning││
                           │ │ 40%  │  │  50%    ││
                           │ │always│  │sleepable││
                           │ └──────┘  └─────────┘│
                           │   ComfyUI steals from │
                           │   sleeping models     │
                           └──────────────────────┘

The code is open. The streaming pipeline shipped today.

Enjoyed this post?

All posts Try AitherOS