Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

hardwaregpusetup

Run a Private AI on Your RTX 5090 — The Complete Guide

Name: AitherOS
Author: Aitherium

May 25, 20265 min readAitherium

Run a Private AI on Your RTX 5090 — The Complete Guide

You spent $2,000 on an RTX 5090. It's got 32GB of VRAM — enough to run multiple models simultaneously, with intelligent VRAM management that swaps models in and out based on what you're doing. Most guides tell you to install Ollama and chat with it. That's leaving 90% of the card's potential on the table.

Here's how to turn that GPU into a complete private AI platform.

What 32GB VRAM Actually Gets You

The real power isn't running one big model — it's running a stack of specialized models that share VRAM intelligently:

Model Role	Example	VRAM	When It's Loaded
Orchestrator (8B AWQ)	Nemotron-Orchestrator-8B	~8GB	Always — pinned, never evicted
Reasoning (14B AWQ)	DeepSeek-R1:14B	~8GB	On demand — loads when you need deep thinking
Embeddings	BGE-large	~1GB	Always — powers search and RAG
Vision (7B)	Qwen2.5-VL	~6GB	On demand — loads when you send images
Image Generation	Flux via ComfyUI	~16GB	On demand — preempts reasoning model

The orchestrator handles everyday chat, coding, and tool use. When you ask something that needs real reasoning (effort 7+), the reasoning model loads automatically. When you drag in an image, the vision model spins up. When you want image generation, ComfyUI takes over and reasoning routes to cloud fallback until it's done.

You don't manage any of this. The MicroScheduler handles VRAM placement, eviction priority, and automatic reload.

The Aitherium Toolchain

There are three ways to use your GPU with Aitherium, and they layer on top of each other:

1. AitherShell — Talk to Your GPU

The fastest path. An interactive terminal that connects to your local models — like Claude Code is to Claude, AitherShell is to AitherOS.

# Install (standalone binary)
npm install -g @aitherium/aithershell

# Use it
aither                                       # Interactive REPL
aither "explain this error log"              # Single query
aither --effort 8 "deep security audit"      # Force reasoning model
aither --will lyra "write a poem"            # Switch persona
echo "data" | aither --print                 # Pipe mode for scripts
aither run deploy-check.aither               # Run script files

AitherShell auto-detects vLLM/Ollama on your machine and routes to the right model based on effort level. It supports non-blocking input (type while it's generating to steer the response), script files with variables and control flow, and a plugin system — drop a .yaml or .py file in ~/.aither/plugins/ and you have a new command.

Also available as a web terminal on port 3001 when running the full platform, or as a Python fallback (pip install aithershell → aither-py).

2. Aither ADK — Build Agent Fleets

The developer SDK. Build multi-agent systems in Python with 3 lines of code.

pip install aither-adk
adk quickstart          # Auto-detect GPU, pull models, configure backends

Single agent:

from adk import AitherAgent

agent = AitherAgent("aither")  # Auto-detects vLLM/Ollama
response = await agent.chat("Review this PR for security issues")

Fleet mode — multiple agents collaborating:

from adk.fleet import load_fleet

fleet = load_fleet(agent_names=["aither", "lyra", "demiurge", "hydra"])
orchestrator = fleet.get_orchestrator()

# Aither delegates to hydra for code review, athena for security
response = await orchestrator.chat("Audit the auth module")

Serve as an OpenAI-compatible API:

adk-serve --agents aither,lyra,demiurge,hydra --port 8080

# Drop-in replacement for any OpenAI client
curl http://localhost:8080/v1/chat/completions \
  -d '{"model":"aither","messages":[{"role":"user","content":"hello"}]}'

Runtime backend switching — swap LLM providers mid-session, no restart:

adk backend set anthropic       # Switch primary to Claude
adk backend set-reasoning deepseek  # Route effort 7+ to DeepSeek
adk backend list                # Show all detected backends

No GPU? No problem. Set an API key and your agents use cloud inference. Have a GPU? They auto-detect it. Both? They route intelligently — local for fast tasks, cloud for deep reasoning.

adk quickstart --cloud    # Cloud-only mode (Anthropic, OpenAI, or DeepSeek)

3. AitherNode — Give Agents a Body

AitherNode is an MCP server that exposes your local hardware to agents. It turns your machine into a tool server:

Local image generation — ComfyUI integration (Flux, SDXL, Pony)
Local LLM inference — Ollama / vLLM integration
File system access — safe local file operations
Hardware stats — GPU/CPU monitoring

AitherNode runs as a daemon. Agents — whether local or cloud-based — connect to it via MCP to use your hardware. Think of it as the bridge between AI brains and your physical machine.

# AitherNode starts automatically with the full platform,
# or run standalone:
docker compose up aither-node

Any MCP-compatible client (Claude Desktop, Cursor, etc.) can connect to AitherNode and get access to 100+ tools — code search, file operations, image generation, memory, and more.

Full Platform (Docker)

For the complete experience — web UI, app builder, 43 agents, deployment — run the full stack:

git clone https://github.com/Aitherium/aither
cd aither
docker compose up -d

Open http://localhost:3000 and you get:

App Builder — build AI agent apps with 40+ panels, no code required
Private Chat — with voice, document analysis, knowledge base
Agent Fleet — 43 specialist agents that collaborate on tasks
Deployment — host apps for your team or clients via Cloudflare tunnels

The ADK and AitherShell work standalone or as part of the full platform. When Genesis (the orchestrator) is running, they gain access to the full context pipeline — memory, knowledge graph, CodeGraph indexing, multi-agent coordination.

How the Model Stack Works

AitherOS doesn't just "run a model." It runs a model stack — a coordinated set of models with priority-based VRAM management. Here's the default local-reasoning stack for an RTX 5090:

VRAM Budget (32GB):
  [P0] Nemotron-Orchestrator 8B AWQ  8GB   PINNED — never evicted
  [P1] Vision 7B               6GB   lazy — loads on image input
  [P3] DeepSeek-R1:14B AWQ     8GB   lazy — loads on effort >= 7
  [P2] ComfyUI (Flux)         16GB   lazy — preempts reasoning
       Embeddings              ~1GB   always loaded

Swap rules:

The orchestrator is pinned at priority 0 — it never gets evicted
DeepSeek-R1 loads on first effort >= 7 request, stays loaded (only 8GB)
ComfyUI preempts the reasoning model when you request image generation
While ComfyUI is active, reasoning requests route to cloud fallback
ComfyUI unloads after 5 minutes idle, reasoning model reloads automatically
Vision coexists with orchestrator + reasoning (22GB total, fits fine)

All of this is managed by MicroScheduler (port 8150), which handles VRAM coordination, priority queuing, and preemption. Every LLM call in the system routes through it — you never call vLLM directly.

Effort-Based Routing

Requests are scored by effort (1-10), which determines which model handles them:

Effort	Tier	Model	Use Case
1-2	Fast	Nemotron-Orchestrator 8B	Quick answers, simple tasks
3-6	Balanced	Nemotron-Orchestrator 8B	Chat, coding, tool use
7+	Reasoning	DeepSeek-R1:14B	Complex analysis, multi-step reasoning

The effort level is assigned automatically based on the complexity of your request. You can also force it — aither --effort 8 "deep analysis" or AitherAgent("aither", effort=8).

GPU Tiers — Not Just for 5090s

The ADK auto-detects your hardware and picks the right model configuration:

adk setup --tier nano          # 6GB VRAM — TQ4 quantized, single model
adk setup --tier standard-tq4  # 12-16GB — both models TQ4
adk setup --tier full          # 24GB+ — orchestrator + reasoning + embeddings
adk quickstart                 # Auto-detect — picks the best tier for your GPU

The 5090 gets the full tier by default, but the entire toolchain works on cards as small as a 3060 (12GB) — just with smaller models and no local reasoning.

Stack Profiles

The system ships with multiple stack profiles you can switch at runtime:

# See available stacks
curl localhost:8001/model-stacks

# Switch stacks (hot — no restart needed)
curl -X POST localhost:8001/model-stacks/switch -d '{"stack": "local-gemma4"}'

Stack	Models	VRAM	Best For
`local-reasoning`	Nemotron-Orchestrator + DeepSeek-R1 + Vision	~17GB	General use (default)
`local-gemma4`	Nemotron-Orchestrator + Gemma 4 E4B	~14GB	Multimodal reasoning (images + CoT in same context)

The local-gemma4 stack is interesting — Gemma 4 has native vision, so it replaces both the reasoning and vision models with one 5GB model, freeing 11GB of VRAM for longer conversations or ComfyUI.

Why Not Just Use LM Studio?

LM Studio is excellent for one thing: downloading and chatting with models. If that's all you need, use it.

But LM Studio gives you a chat window. Aitherium gives you a toolchain:

	LM Studio	Aitherium
Chat with models	Yes	Yes (AitherShell, web UI, or API)
Build agent systems	No	Yes (ADK — Python SDK, fleet mode)
MCP tool server	No	Yes (AitherNode — 100+ tools)
VRAM-managed multi-model	No (one model)	Yes (priority-based eviction)
Effort-based model routing	No	Yes (auto-selects model by complexity)
Runtime backend switching	No	Yes (swap providers mid-session)
Voice input/output	No	Yes (local TTS/STT)
Document analysis / RAG	No	Yes (drag-and-drop, embeddings)
Build apps	No	Yes (40+ panels, visual builder)
Deploy for others	No	Yes (Docker + Cloudflare tunnels)
Multi-agent orchestration	No	Yes (43 agents, swarm coding)
Image generation	No	Yes (ComfyUI, auto VRAM swap)
CLI shell	No	Yes (AitherShell — scriptable, pipeable)
Plugin system	No	Yes (YAML/Python plugins)
Works without GPU	No	Yes (cloud backends, API keys)

We wrote a detailed comparison with Pinokio if you want the full breakdown.

Scaling Beyond One GPU

Have a second machine? AitherOS supports offloading reasoning to a separate GPU over your local network. For example, an NVIDIA DGX Spark with 128GB unified memory can run a 27B reasoning model while your 5090 handles the orchestrator and image generation.

The system handles this transparently — reasoning requests route to the remote model via proxy, and everything looks like one unified stack to the user.

Privacy Guarantees

When running locally:

Zero data leaves your machine (not even telemetry — opt-in only)
No internet required after initial model download
No API keys, no accounts, no cloud dependencies
Works in air-gapped environments
Cloud fallback is opt-in and only activates during VRAM contention

This matters for: lawyers, doctors, financial advisors, journalists, anyone handling sensitive data.

Making Money With Your GPU

A private AI platform isn't just for personal use:

Selling AI services — build agent apps with the ADK, deploy with adk-serve, charge per seat
Consulting — set up private AI for businesses that can't use cloud (legal, medical, finance)
Content creation — image generation, writing, voice — all with zero API costs
Development — code faster with agents that know your entire codebase (CodeGraph indexes everything)
API provider — run adk-serve and sell OpenAI-compatible API access on your own hardware

The ROI math: if you save 2 hours/day of work (or bill 2 hours/day of AI services), the GPU pays for itself in under a month.

Getting Started Today

Fastest path (2 minutes):

npm install -g @aitherium/aithershell
aither "hello"

Developer path (5 minutes):

pip install aither-adk
adk quickstart
adk start

Full platform (10 minutes):

git clone https://github.com/Aitherium/aither
cd aither
docker compose up -d
# Open localhost:3000

Your RTX 5090 is a private AI platform. Stop paying $20/month for ChatGPT when you have better hardware sitting under your desk.

Questions? Join the community or email hello@aitherium.com.

Enjoyed this post?

All posts Try AitherOS

Back to blog

hardwaregpusetup

Run a Private AI on Your RTX 5090 — The Complete Guide

May 25, 20265 min readAitherium

Run a Private AI on Your RTX 5090 — The Complete Guide

Here's how to turn that GPU into a complete private AI platform.

What 32GB VRAM Actually Gets You

The real power isn't running one big model — it's running a stack of specialized models that share VRAM intelligently:

Model Role	Example	VRAM	When It's Loaded
Orchestrator (8B AWQ)	Nemotron-Orchestrator-8B	~8GB	Always — pinned, never evicted
Reasoning (14B AWQ)	DeepSeek-R1:14B	~8GB	On demand — loads when you need deep thinking
Embeddings	BGE-large	~1GB	Always — powers search and RAG
Vision (7B)	Qwen2.5-VL	~6GB	On demand — loads when you send images
Image Generation	Flux via ComfyUI	~16GB	On demand — preempts reasoning model

You don't manage any of this. The MicroScheduler handles VRAM placement, eviction priority, and automatic reload.

The Aitherium Toolchain

There are three ways to use your GPU with Aitherium, and they layer on top of each other:

1. AitherShell — Talk to Your GPU

The fastest path. An interactive terminal that connects to your local models — like Claude Code is to Claude, AitherShell is to AitherOS.

# Install (standalone binary)
npm install -g @aitherium/aithershell

# Use it
aither                                       # Interactive REPL
aither "explain this error log"              # Single query
aither --effort 8 "deep security audit"      # Force reasoning model
aither --will lyra "write a poem"            # Switch persona
echo "data" | aither --print                 # Pipe mode for scripts
aither run deploy-check.aither               # Run script files

Also available as a web terminal on port 3001 when running the full platform, or as a Python fallback (pip install aithershell → aither-py).

2. Aither ADK — Build Agent Fleets

The developer SDK. Build multi-agent systems in Python with 3 lines of code.

pip install aither-adk
adk quickstart          # Auto-detect GPU, pull models, configure backends

Single agent:

from adk import AitherAgent

agent = AitherAgent("aither")  # Auto-detects vLLM/Ollama
response = await agent.chat("Review this PR for security issues")

Fleet mode — multiple agents collaborating:

from adk.fleet import load_fleet

fleet = load_fleet(agent_names=["aither", "lyra", "demiurge", "hydra"])
orchestrator = fleet.get_orchestrator()

# Aither delegates to hydra for code review, athena for security
response = await orchestrator.chat("Audit the auth module")

Serve as an OpenAI-compatible API:

adk-serve --agents aither,lyra,demiurge,hydra --port 8080

# Drop-in replacement for any OpenAI client
curl http://localhost:8080/v1/chat/completions \
  -d '{"model":"aither","messages":[{"role":"user","content":"hello"}]}'

Runtime backend switching — swap LLM providers mid-session, no restart:

adk backend set anthropic       # Switch primary to Claude
adk backend set-reasoning deepseek  # Route effort 7+ to DeepSeek
adk backend list                # Show all detected backends

No GPU? No problem. Set an API key and your agents use cloud inference. Have a GPU? They auto-detect it. Both? They route intelligently — local for fast tasks, cloud for deep reasoning.

adk quickstart --cloud    # Cloud-only mode (Anthropic, OpenAI, or DeepSeek)

3. AitherNode — Give Agents a Body

AitherNode is an MCP server that exposes your local hardware to agents. It turns your machine into a tool server:

Local image generation — ComfyUI integration (Flux, SDXL, Pony)
Local LLM inference — Ollama / vLLM integration
File system access — safe local file operations
Hardware stats — GPU/CPU monitoring

AitherNode runs as a daemon. Agents — whether local or cloud-based — connect to it via MCP to use your hardware. Think of it as the bridge between AI brains and your physical machine.

# AitherNode starts automatically with the full platform,
# or run standalone:
docker compose up aither-node

Any MCP-compatible client (Claude Desktop, Cursor, etc.) can connect to AitherNode and get access to 100+ tools — code search, file operations, image generation, memory, and more.

Full Platform (Docker)

For the complete experience — web UI, app builder, 43 agents, deployment — run the full stack:

git clone https://github.com/Aitherium/aither
cd aither
docker compose up -d

Open http://localhost:3000 and you get:

App Builder — build AI agent apps with 40+ panels, no code required
Private Chat — with voice, document analysis, knowledge base
Agent Fleet — 43 specialist agents that collaborate on tasks
Deployment — host apps for your team or clients via Cloudflare tunnels

How the Model Stack Works

AitherOS doesn't just "run a model." It runs a model stack — a coordinated set of models with priority-based VRAM management. Here's the default local-reasoning stack for an RTX 5090:

VRAM Budget (32GB):
  [P0] Nemotron-Orchestrator 8B AWQ  8GB   PINNED — never evicted
  [P1] Vision 7B               6GB   lazy — loads on image input
  [P3] DeepSeek-R1:14B AWQ     8GB   lazy — loads on effort >= 7
  [P2] ComfyUI (Flux)         16GB   lazy — preempts reasoning
       Embeddings              ~1GB   always loaded

Swap rules:

The orchestrator is pinned at priority 0 — it never gets evicted
DeepSeek-R1 loads on first effort >= 7 request, stays loaded (only 8GB)
ComfyUI preempts the reasoning model when you request image generation
While ComfyUI is active, reasoning requests route to cloud fallback
ComfyUI unloads after 5 minutes idle, reasoning model reloads automatically
Vision coexists with orchestrator + reasoning (22GB total, fits fine)

Effort-Based Routing

Requests are scored by effort (1-10), which determines which model handles them:

Effort	Tier	Model	Use Case
1-2	Fast	Nemotron-Orchestrator 8B	Quick answers, simple tasks
3-6	Balanced	Nemotron-Orchestrator 8B	Chat, coding, tool use
7+	Reasoning	DeepSeek-R1:14B	Complex analysis, multi-step reasoning

The effort level is assigned automatically based on the complexity of your request. You can also force it — aither --effort 8 "deep analysis" or AitherAgent("aither", effort=8).

GPU Tiers — Not Just for 5090s

The ADK auto-detects your hardware and picks the right model configuration:

adk setup --tier nano          # 6GB VRAM — TQ4 quantized, single model
adk setup --tier standard-tq4  # 12-16GB — both models TQ4
adk setup --tier full          # 24GB+ — orchestrator + reasoning + embeddings
adk quickstart                 # Auto-detect — picks the best tier for your GPU

The 5090 gets the full tier by default, but the entire toolchain works on cards as small as a 3060 (12GB) — just with smaller models and no local reasoning.

Stack Profiles

The system ships with multiple stack profiles you can switch at runtime:

# See available stacks
curl localhost:8001/model-stacks

# Switch stacks (hot — no restart needed)
curl -X POST localhost:8001/model-stacks/switch -d '{"stack": "local-gemma4"}'

Stack	Models	VRAM	Best For
`local-reasoning`	Nemotron-Orchestrator + DeepSeek-R1 + Vision	~17GB	General use (default)
`local-gemma4`	Nemotron-Orchestrator + Gemma 4 E4B	~14GB	Multimodal reasoning (images + CoT in same context)

Why Not Just Use LM Studio?

LM Studio is excellent for one thing: downloading and chatting with models. If that's all you need, use it.

But LM Studio gives you a chat window. Aitherium gives you a toolchain:

	LM Studio	Aitherium
Chat with models	Yes	Yes (AitherShell, web UI, or API)
Build agent systems	No	Yes (ADK — Python SDK, fleet mode)
MCP tool server	No	Yes (AitherNode — 100+ tools)
VRAM-managed multi-model	No (one model)	Yes (priority-based eviction)
Effort-based model routing	No	Yes (auto-selects model by complexity)
Runtime backend switching	No	Yes (swap providers mid-session)
Voice input/output	No	Yes (local TTS/STT)
Document analysis / RAG	No	Yes (drag-and-drop, embeddings)
Build apps	No	Yes (40+ panels, visual builder)
Deploy for others	No	Yes (Docker + Cloudflare tunnels)
Multi-agent orchestration	No	Yes (43 agents, swarm coding)
Image generation	No	Yes (ComfyUI, auto VRAM swap)
CLI shell	No	Yes (AitherShell — scriptable, pipeable)
Plugin system	No	Yes (YAML/Python plugins)
Works without GPU	No	Yes (cloud backends, API keys)

We wrote a detailed comparison with Pinokio if you want the full breakdown.

Scaling Beyond One GPU

The system handles this transparently — reasoning requests route to the remote model via proxy, and everything looks like one unified stack to the user.

Privacy Guarantees

When running locally:

Zero data leaves your machine (not even telemetry — opt-in only)
No internet required after initial model download
No API keys, no accounts, no cloud dependencies
Works in air-gapped environments
Cloud fallback is opt-in and only activates during VRAM contention

This matters for: lawyers, doctors, financial advisors, journalists, anyone handling sensitive data.

Making Money With Your GPU

A private AI platform isn't just for personal use:

Selling AI services — build agent apps with the ADK, deploy with adk-serve, charge per seat
Consulting — set up private AI for businesses that can't use cloud (legal, medical, finance)
Content creation — image generation, writing, voice — all with zero API costs
Development — code faster with agents that know your entire codebase (CodeGraph indexes everything)
API provider — run adk-serve and sell OpenAI-compatible API access on your own hardware

The ROI math: if you save 2 hours/day of work (or bill 2 hours/day of AI services), the GPU pays for itself in under a month.

Getting Started Today

Fastest path (2 minutes):

npm install -g @aitherium/aithershell
aither "hello"

Developer path (5 minutes):

pip install aither-adk
adk quickstart
adk start

Full platform (10 minutes):

git clone https://github.com/Aitherium/aither
cd aither
docker compose up -d
# Open localhost:3000

Your RTX 5090 is a private AI platform. Stop paying $20/month for ChatGPT when you have better hardware sitting under your desk.

Questions? Join the community or email hello@aitherium.com.

Enjoyed this post?

All posts Try AitherOS