Early Access Preview
Back to blog
architecturegpuvllmollamavast-aicomfyuihybridinfrastructuredeep-divegithub-actionsswarm

Hybrid AI Deployment: Local Inference, Cloud GPUs, Frontier Models — One System

March 19, 202614 min readDavid Parkhurst
Share

Hybrid AI Deployment: Local Inference, Cloud GPUs, Frontier Models — One System

Published by David Parkhurst — March 19, 2026


There's a spectrum in AI deployment that nobody talks about honestly. On one end: run everything locally, own your data, pay nothing per token, hit VRAM walls constantly. On the other end: send everything to APIs, pay per token, get unlimited scale, own nothing.

We chose neither. We built a system that uses all three tiers — local GPU, rented cloud GPUs, and frontier model APIs — and routes every request to the right one automatically. Not as a theoretical architecture. As a production system running on a single workstation with a single RTX 5090.

Here's the full picture.


The Hardware

One machine:

  • GPU: RTX 5090 (32 GB VRAM)
  • CPU: AMD Ryzen 9
  • RAM: 128 GB DDR5
  • OS: Windows 11 + WSL2
  • Container runtime: Docker Desktop

That's it. No rack of servers. No Kubernetes cluster. One consumer workstation running an AI operating system with 200+ microservices, 65 Docker containers, and a full multimodal inference stack.


Tier 1: Local Inference — vLLM + Ollama

vLLM: The Workhorse

The core of our local inference is vLLM, running multiple model workers inside Docker containers. In our default dev profile, we run five models simultaneously on one GPU:

ModelRoleVRAMPortWhat It Does
Nemotron-Orchestrator-8B (AWQ 4-bit)Orchestrator35%8120Every chat message. Always loaded. Never evicted.
DeepSeek-R1-14B (AWQ)Reasoning40%8176Effort 7+ tasks. Deep analysis. Hot-swappable.
DeepSeek-Coder-V2-LiteCoding18%8200Code generation and refactoring
Qwen2.5-VL-7BVision12%8201Image understanding, screenshot analysis
nomic-embed-text-v1.5Embeddings5%8209Semantic search, memory graph, RAG

Total VRAM utilization: ~80%. The remaining 20% is headroom for ComfyUI, Windows desktop compositing, and CUDA context overhead.

Every model runs behind our MicroScheduler (port 8150), which handles priority queuing, VRAM coordination, and model swapping. When the reasoning model isn't needed, VLLMSwap reclaims its VRAM for other workloads. When a user asks a hard question (effort 7+), it loads back in. The swap takes 15-30 seconds — which is fine, because reasoning tasks already take 10-60 seconds to think through.

Ollama: The Fallback

Before vLLM, we ran everything through Ollama. It's still there as a fallback layer — if vLLM crashes or we're doing local development without Docker, Ollama picks up the work with GGUF-quantized models:

ModelTierVRAMUse Case
nemotron-reflex (Q4_K_M)Reflex~6 GBFast responses, neurons
nemotron-agent (Q5_K_M)Agent~9 GBMulti-step tool usage
nemotron-reasoning (Q5_K_M)Reasoning~12 GBComplex analysis
aither-orchestratorOrchestrator~8 GBTask routing

The routing is transparent. Our LLM Gateway checks vLLM first; if it's unavailable, requests fall through to Ollama on port 11434. Callers never know the difference.

Request → LLM Gateway
              |
         vLLM available?
        /           \
      Yes            No
       |              |
    vLLM API    Ollama API

Why both? vLLM is strictly better for production — continuous batching, paged attention, higher throughput. But Ollama is invaluable for development. It starts in seconds, doesn't need Docker, handles GGUF models natively, and is perfect for single-user testing. We kept both because the fallback has saved us more than once when a vLLM container OOMs at 2 AM.


Tier 1.5: The Multimodal Stack — Vision, Voice, 3D, Image Gen

This is where a single GPU gets interesting. We're not just running text models. The same RTX 5090 hosts:

Vision — Qwen2.5-VL-7B

Users paste screenshots, upload photos, share diagrams. The vision model runs as a dedicated vLLM worker (port 8201) and handles:

  • Screenshot analysis ("what's broken in this UI?")
  • Document OCR and understanding
  • UI element detection for automated testing
  • Reference image analysis for creative workflows

When we built our swarm coding engine, we added a TESTER-VISUAL role that uses vision to verify UI output — it takes a screenshot of a generated component and checks whether the elements match the spec. Vision as a QA tool, not just a chatbot feature.

Voice — faster-whisper + Piper TTS

Speech-to-text runs locally via faster-whisper (large-v3 model, ~3 GB VRAM). Text-to-speech uses Piper, which runs on CPU — no VRAM needed. The voice pipeline coordinates with MicroScheduler for VRAM allocation, and voice "slots" ensure that STT doesn't starve the orchestrator during peak usage.

This isn't a gimmick. Local voice means zero-latency transcription, no audio leaving your network, and the ability to have a voice conversation with your AI operating system while it runs local models. Try that with a cloud API at 200ms round-trip latency.

ComfyUI — Image and Video Generation

ComfyUI handles image generation (FLUX, SDXL), video generation (AnimateDiff), and serves as the backbone for our creative pipeline. It's on-demand — when a user requests image generation, MicroScheduler coordinates VRAM eviction:

  1. Reasoning model gets swapped out (if loaded)
  2. ComfyUI gets allocated ~16 GB for FLUX fp8
  3. Image generates in 10-30 seconds
  4. ComfyUI releases VRAM
  5. Reasoning can reload when needed

We have a dedicated creative GPU profile that flips the VRAM budget: orchestrator gets 35%, and ComfyUI gets a massive 19 GB for uninterrupted creative workflows. Switch profiles with a single PowerShell command.

Hunyuan3D — Text/Image to 3D Mesh

The 3D pipeline runs Hunyuan3D locally (port 8290) for mesh generation from text prompts or reference images. Combined with ComfyUI's image generation, we have a full text-to-image-to-3D pipeline running locally. Our Iris agent (visual artisan identity) orchestrates the whole chain: generate a reference image with FLUX, pass it to Hunyuan3D for mesh generation, output GLB/OBJ with PBR textures.

All of this on one consumer GPU. The trick isn't magic — it's scheduling. MicroScheduler treats VRAM like a shared resource pool, and models that aren't actively serving requests release their allocation.


Tier 2: Cloud GPUs — Vast.ai and Reasoning as a Tool

Local inference works beautifully for one user. When you add multiple concurrent users, the single-GPU model breaks down. Three people requesting orchestrator, reasoning, and image generation simultaneously creates a queue of VRAM swaps that can leave someone waiting 90 seconds for a response.

Our solution: offload the workloads that tolerate latency to rented GPUs.

Vast.ai: Rent by the Second

Vast.ai is a GPU marketplace. An RTX 3090 costs 0.15/hour.AnRTX4090costs0.15/hour. An RTX 4090 costs 0.20/hour. We define cloud node profiles that mirror our local model fleet:

# cloud_node_profiles.yaml
profiles:
  orchestrator:
    model: cyankiwi/Nemotron-Orchestrator-8B-AWQ-4bit
    budget_per_hour: 0.15
    always_on: true    # Always running in cloud

  reasoning:
    model: casperhansen/deepseek-r1-distill-qwen-14b-awq
    budget_per_hour: 0.15
    on_demand: true     # Spins up when needed
    wake_on:
      effort_threshold: 7
      queue_depth_threshold: 3
    sleep_after_idle_min: 15

  vision_voice:
    model: Qwen/Qwen2.5-VL-7B-Instruct
    budget_per_hour: 0.12
    on_demand: true
    shared: true        # Vision + whisper on same GPU

  creative:
    template: comfyui
    budget_per_hour: 0.30
    on_demand: true
    sleep_after_idle_min: 10

Budget guardrails are built in: 5/daycap,hardstopat5/day cap, hard stop at 5 remaining credit, warning events at 80% budget consumed. The CloudProvisioner handles lifecycle — spinning up nodes when demand rises, sleeping them after idle periods, and terminating everything if we hit the budget floor.

Reasoning as a Tool

The key architectural insight: reasoning doesn't need to run on the same GPU as the orchestrator. The orchestrator handles every message and needs sub-second latency. Reasoning tasks already take 30-60 seconds of thinking time. Adding 100ms of network latency to a 30-second reasoning chain is invisible.

So we turned reasoning into a tool call. The orchestrator decides whether a question needs deep thinking, then invokes a reason() tool that sends the problem to a remote GPU (Vast.ai, or a cloud API). While reasoning runs remotely, the local GPU stays free for other users' chat, vision, and image generation requests.

User: "Analyze the race condition in this lock-free queue"
  |
IntentEngine: effort=8, needs_tools=true
  |
EffortScaler: enable_reasoning_tool=true
  |
Orchestrator (local, always loaded):
  "This needs deep analysis."
  → tool_call: reason(problem="analyze race condition...")
  |
_execute_reason():
  → POST to Vast.ai reasoning node (or DeepSeek API)
  → Returns: chain-of-thought analysis
  |
Meanwhile: other users get instant responses — no VRAM contention
  |
Orchestrator: synthesizes final answer from reasoning output

Multi-user problem solved. The orchestrator never waits, vision never queues behind reasoning, and the reasoning model gets a dedicated GPU with zero contention of its own.


Tier 3: Frontier Models — Judges, Training Supervisors, Fallback Chains

Here's where the hybrid model gets genuinely interesting. We don't just use local and cloud GPUs. We integrate frontier models (Claude, GPT-4o, Gemini, DeepSeek API) into specific roles where their capabilities justify the per-token cost.

Frontier Models as Judges

Our swarm coding engine runs 12 agents in a 4-phase pipeline: ARCHITECT, SWARM (8 parallel agents), REVIEW, JUDGE. The judge role is the final arbiter — it decides whether generated code gets accepted, revised, or rejected.

For the judge, accuracy matters more than cost. A bad judgment wastes an entire swarm cycle. So the judge can route to frontier models:

# Production fallback chain for reasoning
fallback_chain:
  - provider: deepseek
    model: deepseek-reasoner
  - provider: google
    model: gemini-3-pro-preview
  - provider: anthropic
    model: claude-opus-4
  - provider: openai
    model: o3

The local DeepSeek-R1:14b handles most judge duties. But when the confidence score is low, or the code is security-critical, the system escalates to a frontier model. Claude Opus reviews the architecture. GPT-4o validates the test coverage. The cost per escalation is pennies. The cost of shipping broken code is much higher.

Frontier Models for Training Supervision

Our closed-loop training pipeline generates training data from real conversations, filters it through quality checks, and fine-tunes the local orchestrator model. The quality check step is where frontier models earn their keep:

  • DPO pair generation: Claude rates response pairs for preference training
  • Safety filtering: Frontier models flag training examples that could introduce harmful behaviors
  • Benchmark validation: After fine-tuning, we run benchmarks against frontier model outputs to verify we haven't regressed

The training loop runs on consumer hardware (QLoRA fine-tuning takes ~28 minutes on the RTX 5090), but the quality supervision uses the best models available. It's like having a senior engineer review the curriculum that trains your junior developers.

The Production Fallback Chain

Our production GPU profile routes specialist models to cloud APIs with cascading fallbacks:

RolePrimaryFallback 1Fallback 2Fallback 3
ReasoningDeepSeek ReasonerGemini 3 ProClaude Opus 4O3
CodingDeepSeek CoderClaude 3.5 SonnetGemini 3 FlashQwen3-Coder
VisionGemini 3 FlashGemini 2.5 ProGPT-4oClaude Sonnet 4
Image GenFLUX 1.1 Pro (Replicate)SDXLFLUX Schnell
3D MeshHunyuan3D-2 (Replicate)StablePoint3DInstantMesh
VideoMinimax Video-01Luma RayHunyuan Video

If DeepSeek's API goes down, reasoning silently falls through to Gemini. If Gemini is slow, it tries Claude. The orchestrator doesn't know or care which backend answered — it gets a response in the same format regardless.


The Nemotron-Elastic Experiment

One of our more interesting experiments has been with NVIDIA's Nemotron-Elastic models. The key innovation: three model sizes (6B/9B/12B) from a single checkpoint through zero-shot activation slicing. You download one set of weights and get three models.

The hybrid Mamba2/Transformer architecture means:

VariantReasoning ScoreVRAM (BF16)Our Use Case
nemotron-elastic:6b70.61~6 GBReflex tier — neurons, background tasks
nemotron-elastic:9b75.95~9 GBAgent tier — multi-step tool usage
nemotron-elastic:12b77.41~12 GBReasoning tier — complex analysis

The practical benefit is hot-swapping between model sizes without reloading weights. Our ElasticModelStrategy keeps the 6B variant always loaded alongside the orchestrator (total ~11 GB), then dynamically slices up to 9B or 12B when a harder task arrives. The slice operation is near-instant compared to loading a completely different model from disk.

We paired this with our existing orchestrator-8B (specialized for routing and tool calling) and got a surprisingly capable four-tier stack:

  1. Effort 1-3: nemotron-elastic:6b (reflex — instant responses)
  2. Effort 4-6: Nemotron-Orchestrator-8B (standard chat, tool calling)
  3. Effort 7-8: nemotron-elastic:12b (deep reasoning, sliced from same checkpoint)
  4. Effort 9-10: DeepSeek-R1:14b via cloud (maximum reasoning quality)

The Nemotron-Elastic models won't replace fine-tuned specialist models for every task. But for a development deployment where you want maximum capability diversity with minimum VRAM, running three model sizes from one checkpoint is genuinely useful.


Coding Swarms: GitHub Actions + Local Runners

The hybrid deployment extends beyond inference. Our coding pipeline — the Dark Factory — is itself a hybrid system that runs across cloud and local compute.

Phase 1: Cloud Swarm (GitHub Actions)

Our dark-factory-swarm.yml workflow fans out across GitHub Actions runners. Each work unit gets its own runner — 4-core, 16 GB RAM, free compute on public repos. Up to 8 parallel jobs generating code simultaneously:

  1. Plan job: Atlas (architect agent) generates a work breakdown via GitHub Models API
  2. Factory jobs: Matrix fan-out — each runner calls GPT-4o/o3-mini to generate one component
  3. Collect job: Merges all artifacts to a feature branch
  4. Notify job: Creates tracking issue + draft PR, fires repository_dispatch for local refinement

13 specialized roles: 3 coders (frontend, backend, integration), 3 testers (unit, QA, visual), 2 security auditors (threat modeling, injection testing), 1 architect, 1 scribe, 1 reviewer, 1 judge. All running in parallel on free cloud compute. Total cost: $0.

Phase 2: Local Refinement (Self-Hosted Runner)

The cloud swarm produces rough code. Syntactically correct, architecturally naive. The real intelligence lives in Phase 2, which runs on a self-hosted GitHub Actions runner — your own machine, with your own GPU.

The local-agent-worker.yml workflow runs on runs-on: self-hosted with direct access to:

  • Genesis (port 8001) — the full agent orchestrator
  • CodeGraph — semantic code search and AST analysis
  • MemoryGraph — historical decisions and past code patterns
  • Local LLMs via MicroScheduler — no API costs
  • All 16 agent identities with full tool profiles
  • SwarmCodingEngine for complex multi-agent tasks

The refinement pipeline:

  1. Hydra (code review agent) — multi-angle review: correctness, style, architecture, test coverage
  2. Athena (security agent) — STRIDE threat modeling with real tool access, injection vector mapping
  3. Demiurge (code agent) — targeted fixes using CodeGraph context and MemoryGraph history
  4. Genesis (judge) — final ACCEPT / REVISE / REJECT verdict

Each agent runs through AgentForge — a ReAct loop with 100+ tools, VRAM-coordinated GPU access, capability-token authorization, and effort-scaled model selection. This is the full AitherOS intelligence stack, not a wrapper around an API call.

The Feedback Loop

The self-hosted runner also handles a daily scheduled run (5 AM UTC):

  • Scans open GitHub issues, picks the top 3 most impactful bugs
  • Hunts for hardcoded ports, bare excepts, TODO/FIXME items in the codebase
  • Fixes the top findings
  • Runs tests to verify
  • Creates separate commits with descriptive messages
  • Opens PRs automatically

Issues labeled agent:local or agent:demiurge get automatically routed to the local runner. Atlas reviews PRs and posts feedback. Demiurge picks up that feedback and pushes revisions. The loop runs without human intervention until a human explicitly approves or rejects.

Forge Dispatch: Bridging Cloud and Local

The forge-dispatch.yml workflow is the router between cloud and local execution:

  • copilot_task: Creates an issue and assigns to GitHub Copilot (cloud agent)
  • agentic_trigger: Dispatches to agentic workflows (continuous docs, testing, simplification)
  • local_agent: Routes to the self-hosted runner for full GPU-backed agent dispatch
  • notify: Posts status updates on issues and PRs

AitherForge (our agent dispatch system) decides at runtime whether a task needs cloud compute (cheap, parallel, no GPU needed) or local compute (GPU access, full codebase context, private data). Simple code generation goes to the cloud swarm. Security audits, performance optimization, and architecture refactoring route to the local runner.


GPU Profiles: Switching Modes in One Command

The entire deployment adapts to what you're doing. We define named GPU profiles that reshape the VRAM allocation:

dev — All Models Local (Default)

Everything runs on one GPU. Orchestrator 35%, reasoning 40%, coding 18%, vision 12%, embeddings 5%. Maximum capability, moderate speed. Good for solo development.

production — Orchestrator Maximized

Orchestrator gets 92% of VRAM — 32K context window, 64 concurrent sequences, CUDA graphs fully populated. Everything else routes to cloud APIs (DeepSeek, Gemini, Replicate). Local orchestrator latency drops from ~800ms to ~200ms first-token. This is the multi-user profile.

creative — Image/3D Pipeline

Orchestrator gets 35%, and 19 GB stays free for ComfyUI, Hunyuan3D, and AnimateDiff. Optimized for creative workflows: text-to-image, image-to-3D, video generation. All local, all GPU-accelerated.

Switch between profiles:

# Switch to production mode
./AitherZero/library/automation-scripts/08-aitheros/0850_Switch-GpuProfile.ps1 -Profile production

# Or via environment variable
$env:AITHER_GPU_PROFILE = "creative"
docker compose -f docker-compose.aitheros.yml up -d

The profile system means we don't compromise. Solo development gets all models local. Multi-user production gets maximum orchestrator speed with cloud specialist models. Creative sessions get maximum ComfyUI VRAM. Same hardware, three different operating modes.


What This Actually Costs

Monthly operating cost for the full hybrid deployment:

ComponentCost
Local GPU (electricity)~$15-20
Vast.ai (reasoning on-demand)~$5-30
Cloud APIs (frontier fallbacks)~$5-20
GitHub Actions (cloud swarm)$0 (free tier)
Self-hosted runner$0 (your own machine)
Total~$25-70/month

Compare: a single Devin seat costs 500/month.AClaudeMaxsubscriptionis500/month. A Claude Max subscription is 200/month. An AWS p5 instance is $23/hour.

We're running 5 local models, bursting to cloud GPUs on demand, using frontier models as judges and training supervisors, running 13-agent coding swarms on free CI compute, and doing local refinement with full GPU acceleration — for less than a nice dinner.


The Principle

The hybrid deployment isn't about being cheap (though it is). It's about putting each workload where it runs best:

  • Latency-critical, high-frequency (chat orchestration): Local GPU, always loaded, zero network hops
  • Latency-tolerant, compute-heavy (reasoning, creative): Cloud GPU, pay-per-use, dedicated VRAM
  • Accuracy-critical, low-frequency (judging, safety, training QA): Frontier models, pay-per-token, maximum capability
  • Embarrassingly parallel, stateless (code generation swarms): Free CI runners, matrix fan-out
  • Context-rich, stateful (code review, security audit, refactoring): Local self-hosted runner, full system access

No single tier is sufficient. Local-only hits VRAM walls. Cloud-only costs too much and leaks data. API-only means you own nothing and control nothing. The hybrid approach gets you the best of all three — and the switching between tiers is invisible to the user asking the question.

The AI deployment question isn't "local or cloud?" It's "which workload goes where?" Once you answer that per-workload instead of per-system, the architecture designs itself.

Enjoyed this post?
Share