Most "AI infrastructure" diagrams are a single box labeled the model, an arrow to a GPU, and an arrow to a bill. Ours doesn't fit in a box. A single user turn can be answered by an 8-billion-parameter model on a desktop GPU in a few hundred milliseconds, or it can fan out to a 27B reasoner on a DGX Spark, an image model, a rented cloud GPU that didn't exist ten minutes ago, and a swarm of agents coordinating through a shared graph — depending entirely on how hard the question is.

The design principle underneath all of it: local-first, escalate on demand. Cheap work stays on hardware we own and never leaves the building. Expensive work climbs a ladder — bigger local model, then the reasoning box, then the cloud — and only as far as it has to. This post walks that ladder from the bottom, following one request up through the stack, then widens out to the deployment fabric that lets it scale past a single desk.

Where something is still in progress, this post says so — there's a status board at the end. No stage is described as more finished than it is.

1. The always-on core: an 8B model on an RTX 5090

Every user turn lands in the same place first: aither-vllm-tq on port 8199, an RTX 5090 running Nemotron-Orchestrator-8B-AWQ. Not a frontier model. An 8B, deliberately, quantized down to about 11.5 GB of VRAM. This is the orchestrator, and it has one job that most systems hand to a much larger model: decide what happens next.

The reason an 8B is enough here is that the orchestrator doesn't try to be the answer to everything. At low effort it responds directly. At high effort it routes — it reaches for a reason tool that hands the hard part to the reasoning brain (§5). Keeping the always-on model small means it's always warm, always cheap, and never the bottleneck. We call this the orchestrator-primary doctrine: the small model is the front door, not the whole house.

Two more things live on the 5090 alongside it: a nomic-embed-text-v1.5 embeddings server on 8209, and a hot-swap slot on 8176 that loads a heavier reasoning/vision model (DeepSeek-R1-14B or Gemma-4-12B, ~18 GB) on demand when the reasoning box is offline. One consumer desktop card, three roles, because none of them needs the card at the same moment.

2. Quantization: how the big models fit at all

You cannot run a 27B model at 131K context on hardware that costs less than a car without being serious about quantization. We run two different schemes because the two machines are different.

On the 5090, the orchestrator uses AWQ (activation-aware weight quantization) at a TQ4 profile — 4-bit weights that preserve the activations that matter — plus TurboQuant on the KV cache (next section). That's what gets an 8B model and its working memory into ~11.5 GB with room to spare.

On the DGX Spark, the reasoner runs NVFP4 — NVIDIA's native 4-bit floating-point format — via the vrfai/Qwen3.6-27B-NVFP4 checkpoint, with the FP4 GEMM kernels going through flashinfer-cutlass on Blackwell (SM 12.1a). But the interesting part isn't the weight format, it's what sits on top of it:

DFlash speculative decoding, k=15. A small draft model proposes 15 tokens ahead; the big model verifies them in a single forward pass and keeps the ones it agrees with. When the draft is good, you get many tokens for the price of one verification step. In production this lands the 27B reasoner at 19–24 tokens/second — usable interactive speed from a model that size, on a box that fits under a desk.

# DGX reasoner (docker-compose.dgx-spark.yml)
VLLM_DGX_NVFP4_MODEL: vrfai/Qwen3.6-27B-NVFP4
VLLM_NVFP4_GEMM_BACKEND: flashinfer-cutlass
NUM_SPECULATIVE_TOKENS: 15          # DFlash draft depth
MAX_MODEL_LEN: 131072               # 131K context
GPU_MEMORY_UTILIZATION: 0.55        # of 128 GB unified

3. AitherKVCache: compression first, tiers when you need them

The KV cache — the attention state for every token in the context window — is usually the thing that actually runs you out of memory, not the weights. At FP16 it's 16 bits per value, per token, per layer. It adds up fast, and it's what caps how much context and how many concurrent requests you can hold.

TurboQuant is our answer: per-channel FP4/INT4 quantization of the K and V tensors, read directly by fused attention kernels without a decompress step. The compression ratio is 6–8×. Concretely, that turns roughly 66K tokens of FP16 cache into ~224K tokens in the same 8.3 GB of VRAM. That headroom is the whole game — it's what buys long context and batched concurrency on a fixed card.

A note on honesty, because we've written about this before. We built a 3-tier hierarchy on top of TurboQuant — VRAM, DDR5, and recompute — so cache can spill to system RAM and beyond (the full story is here). In today's production config, CPU offload is off by default (VLLM_TQ_CPU_OFFLOAD_GB=0): compression alone gives enough headroom that we don't need to pay the offload latency. So the accurate framing is — compression is the always-on win; the tiering is built and available when a workload needs it. Calling the whole thing a "tiered cache" would oversell what's actually running most days.

4. The reasoning brain: a DGX Spark that only wakes for hard questions

The DGX Spark GB10 is the second machine — ARM64, 128 GB of unified memory, running the Qwen3.6-27B-NVFP4 reasoner on port 8120 with the DFlash setup from §2. 131K context, up to 10 concurrent sequences, GPU utilization pinned at 0.55 so weights, KV cache, and the draft model all fit in the unified pool.

It is not in the default path. The orchestrator on the 5090 handles the conversation; when a turn is genuinely hard — architecture, root-cause, multi-step reasoning — the orchestrator invokes the reason tool, and that is what routes to the DGX. The big model is a specialist the small model calls, not a toll every request pays. This is the orchestrator-primary doctrine made physical: two machines, and the expensive one stays asleep until the cheap one decides it's worth waking.

5. The routing spine: MicroScheduler and the effort ladder

Everything above is tied together by one component: MicroScheduler on port 8150. Every LLM call takes the same path —

User turn
  → Genesis (:8001)            request entry
  → LLMGateway                 assembles context, picks intent
  → MicroScheduler (:8150)     queue (16 slots) + backend selection
  → backend                    5090 orchestrator | DGX reasoner | cloud

Backend selection is an effort ladder. Effort 1–6 is answered directly by the orchestrator with no tools offered — no reasoning tax on simple turns. Effort 7+ offers the reason tool, and if the orchestrator pulls it, MicroScheduler routes down a fallback chain:

reason  →  DGX Qwen3.6 (:8120)        primary
        →  local swap DeepSeek-R1     if DGX offline
        →  DeepSeek-V4 (cloud API)    if both unavailable
        →  Claude / GPT-5 / Gemini    last-resort external

MicroScheduler is more than a router. It holds the agent registry (21+ services heartbeating in), gates image-generation slots so the GPU isn't oversubscribed, coordinates VRAM across the fleet, and hot-swaps an agent's "will/spirit/persona" without a restart. It's the single choke point every unit of GPU work passes through — which is exactly why it's the single place we enforce priority and capacity.

6. Beyond text: image generation

Not every request is tokens. Image generation runs through media-forge on port 8200, a unified backend that wraps ComfyUI together with the LoRA/style stack and AitherSafety filtering. The primary model is waiIllustriousSDXL_v140 with a 4-step Lightning variant for speed; a Flux workflow sits behind it as a fallback, and Sana on 8796 — a one-step Linear DiT — handles the "I need an image in ~0.1s" case. Cloud providers (Gemini, OpenAI, BYO-key) backstop everything.

The full operation set is generate / img2img / inpaint / outpaint / upscale (ESRGAN + UltraSharp) / animate / controlnet / 3D-character, with prompt complexity driving an effort routing of its own — simple prompts get fewer steps and a faster model. And critically, image work acquires a slot from MicroScheduler before it runs, so a burst of image jobs can't starve the LLMs sharing the same cards. AitherCanvas on 8108 manages the ComfyUI lifecycle underneath.

7. Beyond text: structured machine learning

The newest modality, shipped and container-validated this week, is structured ML on port 8192 — and it's the one place that deliberately does not route through MicroScheduler, because it isn't an LLM. It's a direct fabric service running two zero-shot foundation models: TabFM 1.6B for tabular classification (≤10 classes) and regression, and TimesFM 2.5 (200M) for time-series forecasting.

"Zero-shot" here means no training step: you hand the model a labeled support set and it predicts in a single forward pass. It lets an agent learn from a new table or a new metric stream immediately, without a GPU training run. The details of the self-teaching loop are their own post — the point for this tour is that the inference fabric isn't only about chat. It serves whatever shape of model the work needs, on the same hardware, behind the same mesh.

8. Scaling out: Vast.ai autoscaling

Two machines have a ceiling. When the local backends saturate and a high-effort reasoning job is waiting, the deploy orchestrator bursts to rented GPUs on Vast.ai through the mesh provider (services/mesh/providers/vast_ai.py) — and how it picks a box is the interesting engineering.

The obvious choice is cheapest-first. For a reasoning node, we deliberately don't: we select fastest-network-first: filter to offers with inet_down > 1000 (1 Gbps+), sort by download bandwidth descending, and dedup by physical machine. The reason is counterintuitive until you've watched a cold start: a rented box's time-to-useful is dominated not by compute but by pulling the multi-gigabyte model image. A cheaper box on a slow uplink loses to a slightly pricier box that's ready in a third of the time. So we optimize for the bottleneck that actually hurts, inside a hard price cap. (The image-generation burst path has its own provisioner that weighs cost more heavily — different workload, different trade-off.)

Getting models onto that box is a three-tier resolution: a private MinIO presigned URL first, then Strata (our curated, RBAC-checked aither://models/... fabric), then public HuggingFace / CivitAI. With a smart-source twist — a rented, external host is told to prefer the public CDN first, because pulling from our MinIO over its uplink would bottleneck; internal consumers prefer private. Downloads run parallel (4 at a time), and models are never baked into the image — the comfyui-cloud container pulls them at boot, joins the mesh over Tailscale with an HMAC pre-shared key, and starts heartbeating every 30 seconds. Idle boxes self-terminate after ~15 minutes; budget hard-stops cap the spend.

One honest gap: cloud nodes can't reach our internal-only scheduler and compute URLs over the tailnet, so their registry visibility currently leans on Strata peer registration plus the Tailscale overlay rather than the clean internal path. It works; it isn't yet the seamless version. It's flagged in the code and on the board below.

9. The swarm: many cheap agents, one synthesis

For work that parallelizes — research, broad code changes, audits — there's a 3-role swarm (adk/swarm.py), and it's worth being precise about what it is, because it's easy to over-describe. It is not a magic "brain router." It's three roles over a shared data plane:

Plan decomposes a goal into ~6 independent subtasks and writes them to a shared Qdrant ledger.
Workers — as many as you launch — each claim an open task (best-effort deconfliction via a claim record with a 600s TTL, so a crashed worker's task gets picked back up), run the model, and dump their findings as 768-dimensional nodes into the graph memory.
Synthesize rehydrates every finding and runs one high-effort reasoning pass over the whole set.

The backend is selectable — SWARM_BACKEND=deepseek|gateway|anthropic|openai. DeepSeek is the workhorse choice for the workers (cheap, strong) with a reasoning model for synthesis; the LLMRouter abstraction is what lets each role target a different provider (and quietly works around DeepSeek's strict chat template by demoting mid-conversation system messages). This is proven at 10 workers locally; fleet-wide go-live is gated on secrets management and Qdrant exposure — so it's on the board as proven, not yet GA.

10. Managed agents: .WORKFORCE

The top of the stack is where agents become deployed products, not just processes. .WORKFORCE is the managed-agent deployment layer, and its current direction is an integration with Anthropic's Managed Agents — agents that run as remote sessions, with a host-side tool relay that bridges our own tools (company_recall, approval_request, RLS-gated Supabase ops) into that remote agent, plus CEO approval gates (webhook + Telegram cards) and optional graph-RAG grounding. Deployments are rows in a managed_agent_deployments table; the migration framework is live and being cut over.

Underneath, the fleet-side orchestration that's been running for a while: AgentForge (identity-aware dispatch, MCTS-chosen multi-agent chains, worktree isolation, snapshots that survive restarts), ExpeditionManager (multi-session projects with dependency graphs and acceptance gates), and AitherWorker on 8159 running eight background loops (scheduler, mail ingest, background jobs, dunning, and more). That's how an "agent" becomes something you deploy, supervise, and bill — rather than a script you babysit.

The whole picture

Follow one turn all the way up and the shape is clear: it enters at Genesis, the 8B orchestrator on the 5090 decides how hard it is, MicroScheduler routes it — direct answer, or a reason call to the DGX 27B, or an image slot, or a structured-ML forward pass — and if the owned hardware is full and the job is worth it, the mesh rents a fast-network GPU on Vast, pulls the model from the nearest source, and folds it into the fleet. Swarms fan the biggest jobs across many cheap workers; .WORKFORCE turns the results into deployed, governed agents. Local-first, escalate on demand, at every single layer.

Here's the honest state of it:

Layer	What it is	Status
5090 orchestrator	Nemotron-8B AWQ/TQ4, every turn's front door	Live
DGX reasoner	Qwen3.6-27B NVFP4 + DFlash k=15, 19–24 tok/s, 131K	Live
AitherKVCache	TurboQuant FP4/INT4 KV compression, 6–8×	Live (tiering built, offload off by default)
MicroScheduler	effort ladder + fallback chain + slot gating	Live
Image generation	media-forge + Sana + SDXL/Flux	Live
Structured ML	TabFM 1.6B / TimesFM 200M, zero-shot	Live (container-validated)
Vast autoscaling	network-first burst + 3-tier model pull	Live
Cloud ↔ internal registry	scheduler/compute reachability from rented boxes	Known gap (Strata/Tailscale workaround)
DeepSeek swarm	3-role, Qdrant-coordinated	Proven at 10 local; fleet go-live gated
.WORKFORCE managed agents	Anthropic Managed Agents + host tool relay	Framework live; migrating

None of this is a bigger model. It's a smaller front door in front of a ladder that only climbs as high as the question demands — and a fabric that can borrow a GPU on the other side of the world when the question is big enough to be worth it.

Enjoyed this post?

All posts Try AitherOS

Back to blog

infrastructureinferencequantizationvllmgpudeploymentarchitecturedeep-dive

One Turn, Two GPUs, and a Cloud: The AitherOS Inference Stack, End to End

July 4, 202617 min readAitherium