Three Tiers, Nine Models, Zero Cloud Bills
Most AI infrastructure looks the same: a beefy GPU box, a vLLM instance, an nginx reverse proxy, and a prayer that your one model handles everything from intent classification to code generation to security analysis. When it doesn't — and it never does — you bolt on prompt engineering, chain-of-thought wrappers, and increasingly desperate system prompts.
We went a different direction. AitherOS runs a fleet of nine purpose-trained models across a three-tier compute topology, where a 0.5B intent classifier makes the routing decision before the 70B reasoning model even wakes up. The big model is a tool. The small models are the nervous system.
This is how it works.
The topology
Three physical tiers. Each exists because the others can't do its job.
Tier 1 — Desktop (RTX 5090, 32 GB VRAM, 128 GB DDR5)
The always-on orchestrator. An 8B parameter model quantized to ~8 GB runs every conversation, every tool call, every agent dispatch. It never sleeps, never evicts. Priority zero.
Alongside it, a 14B DeepSeek-R1 sits in a swap slot for local reasoning — cheap enough to keep warm at 8 GB, evictable when ComfyUI needs VRAM for image generation. When ComfyUI preempts reasoning, reasoning routes to DGX automatically. No user-visible latency spike.
This tier handles effort levels 1-6: the bulk of all interactions. Fast, responsive, always available.
Tier 2 — DGX Spark (GB10 Grace Blackwell, 128 GB unified memory)
The deep thinker. DeepSeek-R1 70B in BF16 occupies ~40 GB of the 128 GB unified coherent memory pool. The orchestrator invokes it as a tool — same dispatch pattern as any other tool call. Not a routing tier, not an auto-escalation. A deliberate invocation when the 8B determines the problem requires it.
Alongside the 70B sits the fleet: four specialist models totaling ~2.5 GB, plus up to 32 LoRA personality adapters at negligible cost. 66 GB of headroom for fleet models, KV cache, and a lazy-loaded 80B coding model (Qwen3-Coder) for heavy implementation tasks.
The DGX runs vLLM with --enable-lora --max-loras 32 --max-lora-rank 64. One inference server, one port, many models.
Tier 3 — Rocky Linux bare-metal nodes (Podman + systemd)
The field deployment. Any machine with SSH access and an internet connection becomes an AitherOS node in one command:
curl -fsSL https://get.aitherium.com/install.sh | bash
Or from the desktop controller:
Invoke-AitherPlaybook deploy-rocky-linux -Variables @{
TargetHost = "10.0.1.50"
Mode = "pull"
Profile = "standard"
}
The playbook installs Podman, configures SELinux, pulls container images from GHCR, generates 113 systemd unit files, opens firewall ports, provisions Ollama models, and runs smoke tests. The node joins the mesh via mDNS discovery and PSK challenge-response. Total time from bare Rocky 9 install to running AitherOS: under 15 minutes.
For air-gapped deployments, build-iso.sh produces a bootable USB image with an embedded kickstart configuration. Boot it, walk away, come back to a running node.
The fleet
The specialist models exist because a 70B generalist is absurdly wasteful for decisions that a 0.5B model can make in 2ms.
Intent classifier (0.5B, 512 MB, persistent)
Classifies incoming queries into tool categories and intent types. Trained on production conversation data via Unsloth LoRA fine-tuning. Before this existed, every query ran through 28 intent patterns. Now the fleet model classifies in a single forward pass, and the patterns are a fallback.
The training loop is closed: ChatEngine logs every classified intent to Strata. Every 24 hours, the fleet training scheduler checks if 50+ new samples have accumulated. If so, it trains a new version, deploys it as a 10% canary, and auto-promotes after 24 hours if accuracy holds.
Tool selector (0.5B, 512 MB, persistent)
AitherOS has 100+ MCP tools. Passing all their schemas to the LLM costs 12,000 tokens per turn. The tool selector predicts the 8 most relevant tools for each task, cutting prompt size by 73%.
This is Tier 0c in a four-tier selection cascade:
- Fleet model (0.5B, ~2ms) — if available and confident
- NanoGPT char-level predictor (<1ms) — per-tool trained
- Hybrid keyword + semantic search (~10ms) — always works
- Full schema dump (all tools) — last resort
Each tier falls through to the next on failure. The system works on day zero with no trained models (tier 3-4). It gets faster as training data accumulates (tier 1-2).
MCTS evaluator (1.5B, 1 GB, lazy-loaded)
The planning system uses Monte Carlo Tree Search to evaluate task decompositions. The evaluator scores plan rollouts — does this sequence of steps actually address the objective? Is the effort realistic? Are dependencies ordered correctly?
Before the fleet model, scoring was pure heuristic: keyword coverage, feasibility bounds, structural checks. Now it blends 50% fleet model judgment with 50% heuristic. The heuristic keeps the model honest; the model catches things heuristics miss — like recognizing that "add OAuth2" requires a database migration that the heuristic has no concept of.
Lazy-loaded with a 10-minute idle timeout. Planning happens in bursts, not continuously.
Neuron predictor (0.5B, 512 MB, persistent)
AitherOS fires "neurons" — background data-gathering tasks that run in parallel during context assembly. Resource usage stats, git history, calendar context, web search results. There are 20+ neuron types. Firing all of them wastes tokens and adds latency. The predictor learns which neurons are actually consumed by the LLM for each query type.
"What time is my next meeting?" fires the calendar neuron. "Refactor the auth middleware" fires code graph + git history. The predictor learns these associations from production usage.
Agent LoRA adapters (5 agents, ~0 MB each, on-demand)
Five key agents have personality LoRA adapters trained on their conversation history:
| Agent | Specialty | Training data |
|---|---|---|
| Demiurge | Code implementation | coding, instruction, code review |
| Hydra | Code review | code review, analysis |
| Athena | Security audit | security, analysis |
| Scribe | Documentation | documentation, instruction |
| Viviane | Memory/knowledge | memory, knowledge, conversation |
These are LoRA adapters on the already-loaded 70B base. vLLM's multi-LoRA support means loading an adapter is a pointer swap, not a model load. The identity YAML declares the adapter:
# config/identities/demiurge.yaml
model_config:
lora_adapter: demiurge-persona-v1
training_data_types: [coding, instruction, code_review]
min_training_samples: 200
AgentForge reads this at dispatch time and passes the adapter name through to the LLM gateway. The 70B model responds differently as Demiurge (terse, code-focused, ships PRs) versus Athena (thorough, security-focused, flags risks) versus Viviane (contemplative, memory-focused, surfaces forgotten context).
The training pipeline harvests per-agent conversations from Strata, filters through the safety pipeline, trains via Unsloth, evaluates against a held-out set, and registers the new version. 200 samples minimum per agent.
The VRAM budget
Numbers matter. Here's the actual allocation:
| Component | VRAM | Location | Mode |
|---|---|---|---|
| 8B orchestrator | 8 GB | Desktop | Always loaded |
| 14B DeepSeek-R1 | 8 GB | Desktop | Swap slot (evicts for ComfyUI) |
| ComfyUI (FLUX) | 16 GB | Desktop | On-demand (preempts reasoning) |
| Embeddings | 1 GB | Desktop | Always loaded |
| Desktop total | 17 GB | of 32 GB | |
| 70B DeepSeek-R1 | ~40 GB | DGX | Always loaded |
| Intent classifier | 512 MB | DGX | Persistent |
| Tool selector | 512 MB | DGX | Persistent |
| MCTS evaluator | 1 GB | DGX | Lazy (10min timeout) |
| Neuron predictor | 512 MB | DGX | Persistent |
| Agent LoRAs (x5) | ~0 | DGX | On-demand adapter swap |
| KV cache + overhead | ~20 GB | DGX | Dynamic |
| DGX total | ~63 GB | of 128 GB | |
| Free remaining | ~65 GB | DGX | Available for 80B coder, more fleet models |
The 80B Qwen3-Coder is a lazy-loaded tool for heavy implementation tasks. It shares the DGX with everything else because 128 GB unified memory is generous.
The deployment pipeline
Deployment isn't an afterthought — it's the same automation framework that manages everything else.
One-click remote deploy (from desktop to any Rocky 9 box):
POST /deploy/podman-node
{
"host": "10.0.1.50",
"mode": "pull",
"profile": "standard",
"gpu": true
}
Genesis calls the AitherZero playbook, which SSHs into the target and runs a 9-phase deployment:
- Validate SSH + OS
- Install packages (Podman, Python 3.12, Ollama)
- Create user + directories + Ed25519 keypair
- Pull images from GHCR (or build from source)
- Generate + install systemd units
- Configure firewall
- Provision AI models
- Start services
- Smoke test
Bootable USB (for air-gapped or new hardware):
sudo bash deploy/rocky-linux/build-iso.sh \
--from-ghcr \
--embed-source \
--kickstart
Produces AitherOS-0.1.0-x86_64.iso. Boot it, the kickstart handles partitioning, package installation, user creation, SELinux policy, and leaves a first-boot prompt to run the bootstrap script. The full stack comes up as systemd services under Podman.
Node management (on the deployed node):
aitheros-ctl status # Service status
aitheros-ctl health # Health checks
aitheros-ctl logs genesis # Follow logs
aitheros-ctl update # Pull + restart
aitheros-ctl backup # Backup all data
Cockpit on port 9090 provides a web console with a Podman plugin for container management.
The fallback chain
Every component has a graceful degradation path. This is non-negotiable.
Fleet model (0.5B on DGX)
↓ DGX unreachable
NanoGPT char-level (KB-sized, in-process)
↓ Model untrained
Regex/keyword heuristic
↓ No patterns match
Default behavior (all tools, no filtering)
Nothing crashes. Nothing blocks. The system works on day zero with no trained models, no DGX, no fleet. It just works better with them. Each layer is additive, not required.
The same pattern applies to the compute topology:
DGX 70B reasoning (best quality)
↓ DGX unreachable
Desktop 14B DeepSeek-R1 (good quality, local)
↓ Evicted by ComfyUI
Cloud reasoning via Vast.ai (pay-per-use)
↓ No cloud configured
Desktop 8B orchestrator (handles everything, lower quality)
What this enables
With the fleet running, the system makes thousands of sub-millisecond routing decisions per conversation that a monolithic model would need full inference passes for. The 70B model only activates when the 0.5B classifier determines the task genuinely requires deep reasoning.
In practice, this means:
- 80% of queries never touch the 70B. Intent classification, tool selection, neuron prediction, and effort routing all happen at the fleet tier.
- Agent personalities are real, not prompted. LoRA adapters trained on conversation history produce measurably different response patterns — not "act like a security expert" in the system prompt, but actual weight-level specialization.
- New nodes deploy in minutes, not days. SSH + internet + one command. The playbook is idempotent. Run it twice, nothing breaks.
- The system improves itself. Every classified intent, every tool selection, every neuron fire becomes training data. The fleet scheduler trains new versions daily, deploys as canary, auto-promotes if metrics hold.
The cloud bill is zero. The small models cost nothing to run alongside the 70B — they're rounding errors in the VRAM budget. The infrastructure pays for itself in the first month by eliminating API calls to hosted models.
That's the topology. Desktop for orchestration, DGX for depth, bare-metal nodes for scale. Nine models, three tiers, zero cloud dependency. The big model is a tool. The small models are the brain.