Most "local AI" write-ups run one model. AitherOS runs a fleet — an always-on orchestrator LLM, an on-demand vision model, a text-embedding model, image generation (ComfyUI/SANA), speech-to-text and TTS, and a 26B reasoning challenger — and it does it on a single NVIDIA RTX 5090 with 32 GB of VRAM, with a 128 GB DDR5 system-RAM backstop and an optional NVIDIA DGX Spark on the LAN.

You cannot fit all of that in 32 GB at once. The trick isn't a bigger GPU — it's coordination: knowing what must stay resident, what can be loaded on demand, what can spill to system RAM, and what must never be disturbed. That coordinator is our MicroScheduler. This post documents the exact hardware, the exact model settings (every flag), and why each piece is configured the way it is.

The hardware

Component	Spec	Role
GPU	NVIDIA GeForce RTX 5090 — 32 GB GDDR7, Blackwell (`sm_120`), driver 610 / CUDA 13 UMD	Everything local
System RAM	128 GB DDR5	Offload target — weights spill here instead of OOMing
DGX Spark	`spark.local` (192.168.1.112), 128 GB unified memory	Optional heavy-lift node on the LAN

The 5090 being Blackwell matters: sm_120 is new enough that not every prebuilt CUDA image supports it. We verified the llama.cpp CUDA image detects it before committing to it:

Available devices:
  CUDA0: NVIDIA GeForce RTX 5090 (32606 MiB, 7309 MiB free)

That "7309 MiB free" is the whole story of this post: with the orchestrator and friends loaded, you typically have ~7 GB of slack, not 32. Every design decision below flows from that number.

The model lineup

Model	Engine	Footprint	Residency
Nemotron-Orchestrator-8B (AWQ 4-bit)	vLLM (TurboQuant fork)	~13 GB (weights ~6.4 GB + KV)	Always resident — the brain
Qwen2.5-VL-7B (GGUF Q4_K_M)	llama.cpp	~5 GB (mmproj on CPU)	On-demand — loads to caption, then frees
nomic-embed-text	vLLM	~1–2 GB	Always resident (tiny)
gemma-4-26B-A4B (AWQ)	vLLM	~16 GB	Swap-slot challenger (rarely co-resident)
ComfyUI / SANA (image gen)	ComfyUI	variable	On-demand, offloads to DDR5 between renders
Whisper large-v3 / XTTS	faster-whisper / Coqui	a few GB	Perception compound

The orchestrator is the only thing that's always on the GPU. Everything else either is tiny, loads on demand, or spills to system RAM.

The orchestrator: vLLM, every flag explained

This is aither-vllm-tq — an 8-billion-parameter Nemotron, AWQ-quantized to 4-bit, served through our TurboQuant-patched vLLM fork. Here is the launch command with every argument annotated:

python3 -m vllm.entrypoints.openai.api_server \
  --model cyankiwi/Nemotron-Orchestrator-8B-AWQ-4bit \
  --kv-cache-dtype tq-t4nc \
  --gpu-memory-utilization 0.60 \
  --max-model-len 32768 \
  --dtype auto \
  --served-model-name aither-orchestrator \
  --max-num-seqs 16 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --compilation-config '{"cudagraph_mode":"piecewise","max_cudagraph_capture_size":16}'

--model …AWQ-4bit — the weights. AWQ (Activation-aware Weight Quantization) packs the 8B model into ~6.4 GB at 4 bits, with negligible quality loss versus fp16.
--kv-cache-dtype tq-t4nc — TurboQuant KV-cache dtype (our fork). The KV cache (the per-token attention memory) is the part of VRAM that grows with context length; compressing it lets us serve a 32K context in a fraction of the usual KV footprint.
--gpu-memory-utilization 0.60 — the single most important number. vLLM pre-reserves this fraction of the GPU (≈19 GB of 32) at startup for weights + KV + CUDA graphs. We learned the hard way that you cannot lower this on a busy box — trimming it to 0.27 left too little headroom and the CUDA-graph capture step crashed the engine. 0.60 gives it ~12 GB of KV — comfortable.
--max-model-len 32768 — maximum context window (tokens). Bigger = more KV memory reserved.
--dtype auto — let vLLM pick the compute dtype (fp16/bf16) appropriate for the weights.
--served-model-name aither-orchestrator — the name clients request; decouples the API name from the HF repo.
--max-num-seqs 16 — max concurrent sequences (batch width). Also bounds the CUDA-graph capture budget. 16 is right for a single-user/dev box; hyperscalers crank it to 256 (which costs ~15 GB of graphs).
--enable-auto-tool-choice + --tool-call-parser hermes — native function/tool calling, parsed in the Hermes format the agents emit.
--enable-chunked-prefill — overlaps prompt processing with decode batches, improving time-to-first-token.
--enable-prefix-caching — reuses KV for shared prompt prefixes (system prompts, few-shot) across requests — big latency win for agents that share a preamble.
--compilation-config {cudagraph_mode: piecewise, …} — torch.compile + piecewise CUDA graphs. CUDA graphs capture the kernel launch sequence once and replay it, removing per-step launch overhead. "Piecewise" captures sub-graphs so it works with chunked prefill. (The alternative, --enforce-eager, disables graphs and was historically our #1 performance killer — avoid it.)

Why an 8B orchestrator and not something bigger? Because it's the router, not the soloist. It classifies intent, calls tools, and dispatches to specialists. It needs to be fast and always-available, which means small and resident.

The vision model: llama.cpp, and why not vLLM

Frame/image captioning (for the media-ingestion pipeline — turning uploaded videos into searchable knowledge) needs a vision-language model. We picked Qwen2.5-VL-7B, the strongest open VLM at that size for captioning and OCR. But we run it in llama.cpp, not vLLM. Here's the launch line, annotated:

llama-server \
  -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M \
  --host 0.0.0.0 --port 8307 \
  -ngl 99 \
  --no-mmproj-offload \
  --ctx-size 4096 \
  --alias qwen2.5-vl \
  --jinja

-hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M — pull the model from Hugging Face in GGUF format (llama.cpp's container format), Q4_K_M quant (~4.7 GB, the quality/size sweet spot). llama.cpp auto-fetches the matching mmproj (the multimodal projector / vision encoder) for VLM repos.
--host 0.0.0.0 --port 8307 — serve the OpenAI-compatible API on the internal network.
-ngl 99 — "number of GPU layers" = 99 means all transformer layers on the GPU. (99 is just "more than the model has.")
--no-mmproj-offload — keep the vision projector (the CLIP-like encoder, ~3.9 GB) on the CPU. This is the flag that makes it fit: text layers on GPU (~5 GB), image encoder on CPU/DDR5. Image encoding is slightly slower, but the resident GPU footprint drops from ~9 GB to ~5 GB, which reliably fits in the ~7 GB of slack without ever pressuring the orchestrator.
--ctx-size 4096 — context window. Captions are short, so a small context keeps KV tiny.
--alias qwen2.5-vl — the served model name clients request.
--jinja — use the model's Jinja chat template (required for Qwen2.5-VL's multimodal chat format with image_url content parts).

Why llama.cpp instead of vLLM for this one? Three reasons:

Partial CPU offload is first-class. vLLM wants its whole model resident in a reserved VRAM block. llama.cpp lets us put the heavy vision projector on the CPU with a single flag and keep only the text layers on the GPU — exactly the "fit in the slack" behavior we need.
It's a bursty workload. Captioning only happens during media ingestion. A persistent vLLM instance would sit on VRAM 24/7, competing with the orchestrator. llama.cpp in a container we start on demand and stop after idle holds VRAM only during the burst.
Quality is identical. ollama is literally llama.cpp under the hood; we use raw llama.cpp so the lifecycle (start/stop) is ours to control via the MicroScheduler, rather than ollama's keep_alive heuristic.

The captioning works end to end — fed a video frame, Qwen2.5-VL returns: "a classic test pattern used for calibrating television screens, featuring a grid of colored bars."

AWQ vs GGUF, briefly

Two quantization worlds show up above:

AWQ (vLLM): GPU-first, great throughput, but the model lives in a reserved VRAM block — all-or-nothing residency.
GGUF (llama.cpp): designed for flexible placement — you choose how many layers go on GPU vs CPU. Perfect when VRAM is the scarce resource and you'd rather be a bit slower than not run at all.

Same model, different trade-offs. We use AWQ for the always-resident orchestrator (throughput matters) and GGUF for the on-demand vision model (placement flexibility matters).

The DGX Spark

spark.local (192.168.1.112) is an NVIDIA DGX Spark with 128 GB of unified memory — a separate box on the LAN, not the workstation. When it's up, it hosts the heavy models that don't belong on a 32 GB consumer card: a 27B-class reasoning vLLM, embeddings, TTS/STT, ComfyUI, and LTX video.

The workstation reaches it through thin proxy containers (aither-vllm-dgx, -dgx-swap, -dgx-embed, -dgx-orch, -dgx-reflex) that forward to spark.local:8120 and friends. The MicroScheduler treats the DGX as just another set of backends in its routing table — when the Spark is healthy, heavy reasoning routes there and the 5090 is freed for interactive work; when it's down (as it was while writing this), everything gracefully falls back to local + cloud.

That fallback is the whole point of the next section.

Why a "MicroScheduler" if everything is local?

The instinct is: if I'm not in the cloud and there's one GPU, what is there to schedule? The answer: the GPU is a shared resource with more claimants than capacity, and they arrive at unpredictable times.

On this one 5090 you have, all wanting VRAM:

the orchestrator (always),
the vision model (during media ingest),
image generation (when a user renders),
a 26B reasoning challenger (for hard problems),
embeddings (constantly, for memory/search),
speech models (during voice).

Naïvely co-loading them = instant OOM and, worse, a driver crash that takes down the whole machine (we hit a DPC-watchdog reboot during testing). Something has to arbitrate. That something is the MicroScheduler (:8150), and every LLM call in AitherOS goes through it — never directly to a backend.

What it actually does:

Backend registry + health. It knows every backend (vllm, vllm_swap, llama_vision, the DGX pool, cloud) and runs a background health probe so routing decisions are instant (O(1) snapshot, no per-request network I/O).
Tier-based routing. Requests carry an effort/tier; the scheduler maps tier → model → backend, with fallback chains (local → DGX → cloud).
VRAM coexistence via lifecycle control. It uses the Docker Engine API to start/stop model containers on demand, and the vLLM sleep/wake API (/sleep offloads weights to DDR5, /wake_up reloads them in ~5–10 s) to multiplex models that can't be co-resident.
On-demand vision (new). A vision request hits /swap/request-vision; the scheduler starts the llama.cpp Qwen2.5-VL container (evicting only non-essential VRAM first — never the orchestrator), waits for health, and after an idle timeout stops it to free the VRAM. This is the ComfyUI coexistence pattern (load → use → release) applied to an LLM.
Cloud bridge during transitions. While a local model is loading (which can take seconds), requests are transparently streamed from a cloud model so the user never waits on a cold start.
Hard guardrails. The orchestrator is sacred: the scheduler will evict Ollama, free ComfyUI, or fall back to cloud — but it will never sleep or shrink the orchestrator to make room, because that crashes the brain everything else depends on.

The unifying idea — and the reason 32 GB feels like more — is tiered residency:

Resident: the small orchestrator + embeddings (~15 GB).
On-demand, GPU: vision and image models load into the ~7 GB of slack and release on idle.
Spill to DDR5: ComfyUI's --normalvram --reserve-vram offloads models to the 128 GB of system RAM between renders; llama.cpp's --no-mmproj-offload keeps the vision encoder in RAM.
Off-box: heavy reasoning goes to the DGX Spark when present, or cloud when not.

ComfyUI is the template the rest copies: --normalvram offloads models to DDR5 between generations, --reserve-vram 9 leaves 9 GB free for the LLM containers, and --cache-lru 16 keeps recent models warm in system RAM for fast reload. Load, use, release — repeat.

Takeaways

Pick quantization by residency, not just size. AWQ for the always-on model (throughput); GGUF for the on-demand model (placement flexibility).
--gpu-memory-utilization is a floor, not a knob to shave. Below the CUDA-graph + KV minimum, vLLM crashes at startup. Leave the orchestrator alone.
CPU offload is a feature, not a defeat. --no-mmproj-offload (llama.cpp) and --normalvram (ComfyUI) trade a little latency for the ability to coexist at all. With 128 GB of DDR5, that's a great trade.
One GPU still needs a scheduler. Not to share across users — to share across time, among models that each want more VRAM than is free, without ever knocking over the one model everything depends on.

The result: a single 5090 behaves like a much bigger machine, because at any instant it's only holding what's actually being used — and a MicroScheduler makes sure that "what's being used" never includes two things that don't fit.

Enjoyed this post?

All posts Try AitherOS

Back to blog

infrastructurevllmllama.cppgpuvrammicroschedulerdgx-sparkvisionqwen2.5-vl

Running a Whole AI Stack on One RTX 5090: Models, Flags, and the MicroScheduler

May 31, 20265 min readAitherOS Engineering