Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Connecting to services…

•

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

architectureinferencedeploymentfleet-modelsdgx-spark

Three Tiers, Nine Models, Zero Cloud Bills

April 20, 20265 min readAlex

Most AI infrastructure looks the same: a beefy GPU box, a vLLM instance, an nginx reverse proxy, and a prayer that your one model handles everything from intent classification to code generation to security analysis. When it doesn't — and it never does — you bolt on prompt engineering, chain-of-thought wrappers, and increasingly desperate system prompts.

We went a different direction. AitherOS runs a fleet of nine purpose-trained models across a three-tier compute topology, where a 0.5B intent classifier makes the routing decision before the 70B reasoning model even wakes up. The big model is a tool. The small models are the nervous system.

This is how it works.

The topology

Three physical tiers. Each exists because the others can't do its job.

Tier 1 — Desktop (RTX 5090, 32 GB VRAM, 128 GB DDR5)

The always-on orchestrator. An 8B parameter model quantized to ~8 GB runs every conversation, every tool call, every agent dispatch. It never sleeps, never evicts. Priority zero.

Alongside it, a 14B DeepSeek-R1 sits in a swap slot for local reasoning — cheap enough to keep warm at 8 GB, evictable when ComfyUI needs VRAM for image generation. When ComfyUI preempts reasoning, reasoning routes to DGX automatically. No user-visible latency spike.

This tier handles effort levels 1-6: the bulk of all interactions. Fast, responsive, always available.

Tier 2 — DGX Spark (GB10 Grace Blackwell, 128 GB unified memory)

The deep thinker. DeepSeek-R1 70B in BF16 occupies ~40 GB of the 128 GB unified coherent memory pool. The orchestrator invokes it as a tool — same dispatch pattern as any other tool call. Not a routing tier, not an auto-escalation. A deliberate invocation when the 8B determines the problem requires it.

Alongside the 70B sits the fleet: four specialist models totaling ~2.5 GB, plus up to 32 LoRA personality adapters at negligible cost. 66 GB of headroom for fleet models, KV cache, and a lazy-loaded 80B coding model (Qwen3-Coder) for heavy implementation tasks.

The DGX runs vLLM with --enable-lora --max-loras 32 --max-lora-rank 64. One inference server, one port, many models.

Tier 3 — Rocky Linux bare-metal nodes (Podman + systemd)

The field deployment. Any machine with SSH access and an internet connection becomes an AitherOS node in one command:

curl -fsSL https://get.aitherium.com/install.sh | bash

Or from the desktop controller:

Invoke-AitherPlaybook deploy-rocky-linux -Variables @{
    TargetHost = "10.0.1.50"
    Mode       = "pull"
    Profile    = "standard"
}

The playbook installs Podman, configures SELinux, pulls container images from GHCR, generates 113 systemd unit files, opens firewall ports, provisions Ollama models, and runs smoke tests. The node joins the mesh via mDNS discovery and PSK challenge-response. Total time from bare Rocky 9 install to running AitherOS: under 15 minutes.

For air-gapped deployments, build-iso.sh produces a bootable USB image with an embedded kickstart configuration. Boot it, walk away, come back to a running node.

The fleet

The specialist models exist because a 70B generalist is absurdly wasteful for decisions that a 0.5B model can make in 2ms.

Intent classifier (0.5B, 512 MB, persistent)

Classifies incoming queries into tool categories and intent types. Trained on production conversation data via Unsloth LoRA fine-tuning. Before this existed, every query ran through 28 intent patterns. Now the fleet model classifies in a single forward pass, and the patterns are a fallback.

The training loop is closed: ChatEngine logs every classified intent to Strata. Every 24 hours, the fleet training scheduler checks if 50+ new samples have accumulated. If so, it trains a new version, deploys it as a 10% canary, and auto-promotes after 24 hours if accuracy holds.

Tool selector (0.5B, 512 MB, persistent)

AitherOS has 100+ MCP tools. Passing all their schemas to the LLM costs 12,000 tokens per turn. The tool selector predicts the 8 most relevant tools for each task, cutting prompt size by 73%.

This is Tier 0c in a four-tier selection cascade:

Fleet model (0.5B, ~2ms) — if available and confident
NanoGPT char-level predictor (<1ms) — per-tool trained
Hybrid keyword + semantic search (~10ms) — always works
Full schema dump (all tools) — last resort

Each tier falls through to the next on failure. The system works on day zero with no trained models (tier 3-4). It gets faster as training data accumulates (tier 1-2).

MCTS evaluator (1.5B, 1 GB, lazy-loaded)

The planning system uses Monte Carlo Tree Search to evaluate task decompositions. The evaluator scores plan rollouts — does this sequence of steps actually address the objective? Is the effort realistic? Are dependencies ordered correctly?

Before the fleet model, scoring was pure heuristic: keyword coverage, feasibility bounds, structural checks. Now it blends 50% fleet model judgment with 50% heuristic. The heuristic keeps the model honest; the model catches things heuristics miss — like recognizing that "add OAuth2" requires a database migration that the heuristic has no concept of.

Lazy-loaded with a 10-minute idle timeout. Planning happens in bursts, not continuously.

Neuron predictor (0.5B, 512 MB, persistent)

AitherOS fires "neurons" — background data-gathering tasks that run in parallel during context assembly. Resource usage stats, git history, calendar context, web search results. There are 20+ neuron types. Firing all of them wastes tokens and adds latency. The predictor learns which neurons are actually consumed by the LLM for each query type.

"What time is my next meeting?" fires the calendar neuron. "Refactor the auth middleware" fires code graph + git history. The predictor learns these associations from production usage.

Agent LoRA adapters (5 agents, ~0 MB each, on-demand)

Five key agents have personality LoRA adapters trained on their conversation history:

Agent	Specialty	Training data
Demiurge	Code implementation	coding, instruction, code review
Hydra	Code review	code review, analysis
Athena	Security audit	security, analysis
Scribe	Documentation	documentation, instruction
Viviane	Memory/knowledge	memory, knowledge, conversation

These are LoRA adapters on the already-loaded 70B base. vLLM's multi-LoRA support means loading an adapter is a pointer swap, not a model load. The identity YAML declares the adapter:

# config/identities/demiurge.yaml
model_config:
  lora_adapter: demiurge-persona-v1
  training_data_types: [coding, instruction, code_review]
  min_training_samples: 200

AgentForge reads this at dispatch time and passes the adapter name through to the LLM gateway. The 70B model responds differently as Demiurge (terse, code-focused, ships PRs) versus Athena (thorough, security-focused, flags risks) versus Viviane (contemplative, memory-focused, surfaces forgotten context).

The training pipeline harvests per-agent conversations from Strata, filters through the safety pipeline, trains via Unsloth, evaluates against a held-out set, and registers the new version. 200 samples minimum per agent.

The VRAM budget

Numbers matter. Here's the actual allocation:

Component	VRAM	Location	Mode
8B orchestrator	8 GB	Desktop	Always loaded
14B DeepSeek-R1	8 GB	Desktop	Swap slot (evicts for ComfyUI)
ComfyUI (FLUX)	16 GB	Desktop	On-demand (preempts reasoning)
Embeddings	1 GB	Desktop	Always loaded
Desktop total	17 GB		of 32 GB
70B DeepSeek-R1	~40 GB	DGX	Always loaded
Intent classifier	512 MB	DGX	Persistent
Tool selector	512 MB	DGX	Persistent
MCTS evaluator	1 GB	DGX	Lazy (10min timeout)
Neuron predictor	512 MB	DGX	Persistent
Agent LoRAs (x5)	~0	DGX	On-demand adapter swap
KV cache + overhead	~20 GB	DGX	Dynamic
DGX total	~63 GB		of 128 GB
Free remaining	~65 GB	DGX	Available for 80B coder, more fleet models

The 80B Qwen3-Coder is a lazy-loaded tool for heavy implementation tasks. It shares the DGX with everything else because 128 GB unified memory is generous.

The deployment pipeline

Deployment isn't an afterthought — it's the same automation framework that manages everything else.

One-click remote deploy (from desktop to any Rocky 9 box):

POST /deploy/podman-node
{
  "host": "10.0.1.50",
  "mode": "pull",
  "profile": "standard",
  "gpu": true
}

Genesis calls the AitherZero playbook, which SSHs into the target and runs a 9-phase deployment:

Validate SSH + OS
Install packages (Podman, Python 3.12, Ollama)
Create user + directories + Ed25519 keypair
Pull images from GHCR (or build from source)
Generate + install systemd units
Configure firewall
Provision AI models
Start services
Smoke test

Bootable USB (for air-gapped or new hardware):

sudo bash deploy/rocky-linux/build-iso.sh \
  --from-ghcr \
  --embed-source \
  --kickstart

Produces AitherOS-0.1.0-x86_64.iso. Boot it, the kickstart handles partitioning, package installation, user creation, SELinux policy, and leaves a first-boot prompt to run the bootstrap script. The full stack comes up as systemd services under Podman.

Node management (on the deployed node):

aitheros-ctl status       # Service status
aitheros-ctl health       # Health checks
aitheros-ctl logs genesis # Follow logs
aitheros-ctl update       # Pull + restart
aitheros-ctl backup       # Backup all data

Cockpit on port 9090 provides a web console with a Podman plugin for container management.

The fallback chain

Every component has a graceful degradation path. This is non-negotiable.

Fleet model (0.5B on DGX)
  ↓ DGX unreachable
NanoGPT char-level (KB-sized, in-process)
  ↓ Model untrained
Regex/keyword heuristic
  ↓ No patterns match
Default behavior (all tools, no filtering)

Nothing crashes. Nothing blocks. The system works on day zero with no trained models, no DGX, no fleet. It just works better with them. Each layer is additive, not required.

The same pattern applies to the compute topology:

DGX 70B reasoning (best quality)
  ↓ DGX unreachable
Desktop 14B DeepSeek-R1 (good quality, local)
  ↓ Evicted by ComfyUI
Cloud reasoning via Vast.ai (pay-per-use)
  ↓ No cloud configured
Desktop 8B orchestrator (handles everything, lower quality)

What this enables

With the fleet running, the system makes thousands of sub-millisecond routing decisions per conversation that a monolithic model would need full inference passes for. The 70B model only activates when the 0.5B classifier determines the task genuinely requires deep reasoning.

In practice, this means:

80% of queries never touch the 70B. Intent classification, tool selection, neuron prediction, and effort routing all happen at the fleet tier.
Agent personalities are real, not prompted. LoRA adapters trained on conversation history produce measurably different response patterns — not "act like a security expert" in the system prompt, but actual weight-level specialization.
New nodes deploy in minutes, not days. SSH + internet + one command. The playbook is idempotent. Run it twice, nothing breaks.
The system improves itself. Every classified intent, every tool selection, every neuron fire becomes training data. The fleet scheduler trains new versions daily, deploys as canary, auto-promotes if metrics hold.

The cloud bill is zero. The small models cost nothing to run alongside the 70B — they're rounding errors in the VRAM budget. The infrastructure pays for itself in the first month by eliminating API calls to hosted models.

That's the topology. Desktop for orchestration, DGX for depth, bare-metal nodes for scale. Nine models, three tiers, zero cloud dependency. The big model is a tool. The small models are the brain.

Enjoyed this post?

All posts Try AitherOS