Run a Private AI on Your RTX 5090 — The Complete Guide
Run a Private AI on Your RTX 5090 — The Complete Guide
You spent $2,000 on an RTX 5090. It's got 32GB of VRAM — enough to run multiple models simultaneously, with intelligent VRAM management that swaps models in and out based on what you're doing. Most guides tell you to install Ollama and chat with it. That's leaving 90% of the card's potential on the table.
Here's how to turn that GPU into a complete private AI platform.
What 32GB VRAM Actually Gets You
The real power isn't running one big model — it's running a stack of specialized models that share VRAM intelligently:
| Model Role | Example | VRAM | When It's Loaded |
|---|---|---|---|
| Orchestrator (8B AWQ) | Nemotron-Orchestrator-8B | ~8GB | Always — pinned, never evicted |
| Reasoning (14B AWQ) | DeepSeek-R1:14B | ~8GB | On demand — loads when you need deep thinking |
| Embeddings | BGE-large | ~1GB | Always — powers search and RAG |
| Vision (7B) | Qwen2.5-VL | ~6GB | On demand — loads when you send images |
| Image Generation | Flux via ComfyUI | ~16GB | On demand — preempts reasoning model |
The orchestrator handles everyday chat, coding, and tool use. When you ask something that needs real reasoning (effort 7+), the reasoning model loads automatically. When you drag in an image, the vision model spins up. When you want image generation, ComfyUI takes over and reasoning routes to cloud fallback until it's done.
You don't manage any of this. The MicroScheduler handles VRAM placement, eviction priority, and automatic reload.
The Aitherium Toolchain
There are three ways to use your GPU with Aitherium, and they layer on top of each other:
1. AitherShell — Talk to Your GPU
The fastest path. An interactive terminal that connects to your local models — like Claude Code is to Claude, AitherShell is to AitherOS.
# Install (standalone binary)
npm install -g @aitherium/aithershell
# Use it
aither # Interactive REPL
aither "explain this error log" # Single query
aither --effort 8 "deep security audit" # Force reasoning model
aither --will lyra "write a poem" # Switch persona
echo "data" | aither --print # Pipe mode for scripts
aither run deploy-check.aither # Run script files
AitherShell auto-detects vLLM/Ollama on your machine and routes to the right model based on effort level. It supports non-blocking input (type while it's generating to steer the response), script files with variables and control flow, and a plugin system — drop a .yaml or .py file in ~/.aither/plugins/ and you have a new command.
Also available as a web terminal on port 3001 when running the full platform, or as a Python fallback (pip install aithershell → aither-py).
2. Aither ADK — Build Agent Fleets
The developer SDK. Build multi-agent systems in Python with 3 lines of code.
pip install aither-adk
adk quickstart # Auto-detect GPU, pull models, configure backends
Single agent:
from adk import AitherAgent
agent = AitherAgent("aither") # Auto-detects vLLM/Ollama
response = await agent.chat("Review this PR for security issues")
Fleet mode — multiple agents collaborating:
from adk.fleet import load_fleet
fleet = load_fleet(agent_names=["aither", "lyra", "demiurge", "hydra"])
orchestrator = fleet.get_orchestrator()
# Aither delegates to hydra for code review, athena for security
response = await orchestrator.chat("Audit the auth module")
Serve as an OpenAI-compatible API:
adk-serve --agents aither,lyra,demiurge,hydra --port 8080
# Drop-in replacement for any OpenAI client
curl http://localhost:8080/v1/chat/completions \
-d '{"model":"aither","messages":[{"role":"user","content":"hello"}]}'
Runtime backend switching — swap LLM providers mid-session, no restart:
adk backend set anthropic # Switch primary to Claude
adk backend set-reasoning deepseek # Route effort 7+ to DeepSeek
adk backend list # Show all detected backends
No GPU? No problem. Set an API key and your agents use cloud inference. Have a GPU? They auto-detect it. Both? They route intelligently — local for fast tasks, cloud for deep reasoning.
adk quickstart --cloud # Cloud-only mode (Anthropic, OpenAI, or DeepSeek)
3. AitherNode — Give Agents a Body
AitherNode is an MCP server that exposes your local hardware to agents. It turns your machine into a tool server:
- Local image generation — ComfyUI integration (Flux, SDXL, Pony)
- Local LLM inference — Ollama / vLLM integration
- File system access — safe local file operations
- Hardware stats — GPU/CPU monitoring
AitherNode runs as a daemon. Agents — whether local or cloud-based — connect to it via MCP to use your hardware. Think of it as the bridge between AI brains and your physical machine.
# AitherNode starts automatically with the full platform,
# or run standalone:
docker compose up aither-node
Any MCP-compatible client (Claude Desktop, Cursor, etc.) can connect to AitherNode and get access to 100+ tools — code search, file operations, image generation, memory, and more.
Full Platform (Docker)
For the complete experience — web UI, app builder, 43 agents, deployment — run the full stack:
git clone https://github.com/Aitherium/aither
cd aither
docker compose up -d
Open http://localhost:3000 and you get:
- App Builder — build AI agent apps with 40+ panels, no code required
- Private Chat — with voice, document analysis, knowledge base
- Agent Fleet — 43 specialist agents that collaborate on tasks
- Deployment — host apps for your team or clients via Cloudflare tunnels
The ADK and AitherShell work standalone or as part of the full platform. When Genesis (the orchestrator) is running, they gain access to the full context pipeline — memory, knowledge graph, CodeGraph indexing, multi-agent coordination.
How the Model Stack Works
AitherOS doesn't just "run a model." It runs a model stack — a coordinated set of models with priority-based VRAM management. Here's the default local-reasoning stack for an RTX 5090:
VRAM Budget (32GB):
[P0] Nemotron-Orchestrator 8B AWQ 8GB PINNED — never evicted
[P1] Vision 7B 6GB lazy — loads on image input
[P3] DeepSeek-R1:14B AWQ 8GB lazy — loads on effort >= 7
[P2] ComfyUI (Flux) 16GB lazy — preempts reasoning
Embeddings ~1GB always loaded
Swap rules:
- The orchestrator is pinned at priority 0 — it never gets evicted
- DeepSeek-R1 loads on first effort >= 7 request, stays loaded (only 8GB)
- ComfyUI preempts the reasoning model when you request image generation
- While ComfyUI is active, reasoning requests route to cloud fallback
- ComfyUI unloads after 5 minutes idle, reasoning model reloads automatically
- Vision coexists with orchestrator + reasoning (22GB total, fits fine)
All of this is managed by MicroScheduler (port 8150), which handles VRAM coordination, priority queuing, and preemption. Every LLM call in the system routes through it — you never call vLLM directly.
Effort-Based Routing
Requests are scored by effort (1-10), which determines which model handles them:
| Effort | Tier | Model | Use Case |
|---|---|---|---|
| 1-2 | Fast | Nemotron-Orchestrator 8B | Quick answers, simple tasks |
| 3-6 | Balanced | Nemotron-Orchestrator 8B | Chat, coding, tool use |
| 7+ | Reasoning | DeepSeek-R1:14B | Complex analysis, multi-step reasoning |
The effort level is assigned automatically based on the complexity of your request. You can also force it — aither --effort 8 "deep analysis" or AitherAgent("aither", effort=8).
GPU Tiers — Not Just for 5090s
The ADK auto-detects your hardware and picks the right model configuration:
adk setup --tier nano # 6GB VRAM — TQ4 quantized, single model
adk setup --tier standard-tq4 # 12-16GB — both models TQ4
adk setup --tier full # 24GB+ — orchestrator + reasoning + embeddings
adk quickstart # Auto-detect — picks the best tier for your GPU
The 5090 gets the full tier by default, but the entire toolchain works on cards as small as a 3060 (12GB) — just with smaller models and no local reasoning.
Stack Profiles
The system ships with multiple stack profiles you can switch at runtime:
# See available stacks
curl localhost:8001/model-stacks
# Switch stacks (hot — no restart needed)
curl -X POST localhost:8001/model-stacks/switch -d '{"stack": "local-gemma4"}'
| Stack | Models | VRAM | Best For |
|---|---|---|---|
local-reasoning | Nemotron-Orchestrator + DeepSeek-R1 + Vision | ~17GB | General use (default) |
local-gemma4 | Nemotron-Orchestrator + Gemma 4 E4B | ~14GB | Multimodal reasoning (images + CoT in same context) |
The local-gemma4 stack is interesting — Gemma 4 has native vision, so it replaces both the reasoning and vision models with one 5GB model, freeing 11GB of VRAM for longer conversations or ComfyUI.
Why Not Just Use LM Studio?
LM Studio is excellent for one thing: downloading and chatting with models. If that's all you need, use it.
But LM Studio gives you a chat window. Aitherium gives you a toolchain:
| LM Studio | Aitherium | |
|---|---|---|
| Chat with models | Yes | Yes (AitherShell, web UI, or API) |
| Build agent systems | No | Yes (ADK — Python SDK, fleet mode) |
| MCP tool server | No | Yes (AitherNode — 100+ tools) |
| VRAM-managed multi-model | No (one model) | Yes (priority-based eviction) |
| Effort-based model routing | No | Yes (auto-selects model by complexity) |
| Runtime backend switching | No | Yes (swap providers mid-session) |
| Voice input/output | No | Yes (local TTS/STT) |
| Document analysis / RAG | No | Yes (drag-and-drop, embeddings) |
| Build apps | No | Yes (40+ panels, visual builder) |
| Deploy for others | No | Yes (Docker + Cloudflare tunnels) |
| Multi-agent orchestration | No | Yes (43 agents, swarm coding) |
| Image generation | No | Yes (ComfyUI, auto VRAM swap) |
| CLI shell | No | Yes (AitherShell — scriptable, pipeable) |
| Plugin system | No | Yes (YAML/Python plugins) |
| Works without GPU | No | Yes (cloud backends, API keys) |
We wrote a detailed comparison with Pinokio if you want the full breakdown.
Scaling Beyond One GPU
Have a second machine? AitherOS supports offloading reasoning to a separate GPU over your local network. For example, an NVIDIA DGX Spark with 128GB unified memory can run a 27B reasoning model while your 5090 handles the orchestrator and image generation.
The system handles this transparently — reasoning requests route to the remote model via proxy, and everything looks like one unified stack to the user.
Privacy Guarantees
When running locally:
- Zero data leaves your machine (not even telemetry — opt-in only)
- No internet required after initial model download
- No API keys, no accounts, no cloud dependencies
- Works in air-gapped environments
- Cloud fallback is opt-in and only activates during VRAM contention
This matters for: lawyers, doctors, financial advisors, journalists, anyone handling sensitive data.
Making Money With Your GPU
A private AI platform isn't just for personal use:
- Selling AI services — build agent apps with the ADK, deploy with
adk-serve, charge per seat - Consulting — set up private AI for businesses that can't use cloud (legal, medical, finance)
- Content creation — image generation, writing, voice — all with zero API costs
- Development — code faster with agents that know your entire codebase (CodeGraph indexes everything)
- API provider — run
adk-serveand sell OpenAI-compatible API access on your own hardware
The ROI math: if you save 2 hours/day of work (or bill 2 hours/day of AI services), the GPU pays for itself in under a month.
Getting Started Today
Fastest path (2 minutes):
npm install -g @aitherium/aithershell
aither "hello"
Developer path (5 minutes):
pip install aither-adk
adk quickstart
adk start
Full platform (10 minutes):
git clone https://github.com/Aitherium/aither
cd aither
docker compose up -d
# Open localhost:3000
Your RTX 5090 is a private AI platform. Stop paying $20/month for ChatGPT when you have better hardware sitting under your desk.
Questions? Join the community or email hello@aitherium.com.