AitherOSArchitecture Deep Dive

Documentation →Investor Pitch → Dashboard

A solo-built agentic operating system — here's how it works

I Built a Runtime Where AI Lives

131 microservices. 15 specialist agents. A six-pillar cognitive architecture. Pain-driven recovery. Self-improving feedback loops.
All running on my own hardware. Let me show you how it works.

What makes this an actual Agentic OS — not just prompt automation:

Agents can modify the system

Agents can break the system

System recovers automatically

Memory persists across missions

Hardware constraints enforced

All actions observable & auditable

131

Services

Agents

Service Groups

Memory Layers

SCROLL TO EXPLORE

Chapter I — Standing on Shoulders

Research-Validated Architecture

Every architectural decision in AitherOS is grounded in published research, industry standards, or peer-reviewed patterns. Agentic AI, biological feedback loops, multi-tier memory, local-first sovereignty — none of this is novel for novelty's sake. Here's the evidence.

33%

of enterprise software will include agentic AI by 2028 (Gartner)

280x

decline in inference costs 2022–2024, making local-first viable

71%

of executives call sovereign AI an “existential concern” (McKinsey)

Agentic OS Paradigm

AitherOS

131 services as a living runtime where agents have persistent state, memory, and lifecycle management.

Industry Validation

AIOS: LLM Agent Operating System accepted at COLM 2025. PwC and VAST Data launched enterprise agent OS platforms in 2026. Gartner: 33% of enterprise software will include agentic AI by 2028, up from <1% in 2024.

AIOS Paper (COLM 2025)Fluid.ai: Agentic OS Explained

Multi-Agent Orchestration

AitherOS

15 specialist agents with distinct personas, ports, and lifecycle. Not prompt wrappers—real services.

Industry Validation

Microsoft AutoGen v0.4 adopted actor-model multi-agent orchestration. CrewAI raised $18M Series A— 60% of Fortune 500. The industry converged on multi-agent as the default pattern.

AutoGen (Microsoft Research)Agent Framework Landscape 2025

Cognitive Architecture

AitherOS

Six-pillar circular cycle: Intent→Reasoning→Orchestration→Context→Creation→Learning.

Industry Validation

40+ years of cognitive architecture research (SOAR, ACT-R, CLARION). Recent hybrid approaches integrating symbolic reasoning with neural modules show improved explainability and grounded decision-making.

40 Years of Cognitive Architectures (Springer)ACT-R vs SOAR Analysis

Adaptive Effort (Dual-Process)

AitherOS

Auto-scales effort 1–10: from 500ms reflexes (1 LLM call) to 5-minute deep analysis (20 calls).

Industry Validation

Grounded in Kahneman's System 1/System 2 theory. Cognitive Decision Routing for LLMs achieves superior performance while reducing compute costs by 34%. Software optimization outpaces hardware by 10x.

Fast & Slow in Machine Intelligence (CACM)Cognitive Decision Routing

Sovereign / Local-First AI

AitherOS

All models local. Zero external API dependency. 6 tiers from 9B to 80B parameters.

Industry Validation

McKinsey: 71% of executives call sovereign AI an "existential concern." Deloitte: inference = 2/3 of all compute by 2026. Inference costs declined 280x in two years—local is now viable.

Sovereign AI Agenda (McKinsey)Compute Forecast (Deloitte)

Memory Hierarchy

AitherOS

6-tier memory from 1ms working memory to permanent storage. MemoryGraph with 10 edge types. Biological decay with reinforcement. Access-driven promotion.

Industry Validation

MemGPT pioneered OS-inspired virtual memory paging for LLMs. Industry is moving beyond stateless RAG toward hierarchical, persistent memory architectures with structured representations.

MemGPT Research Long-Term Memory Design Patterns

Pain System / Homeostasis

AitherOS

Biological pain scale (0.0—1.0) with circuit breakers and automatic self-healing recovery.

Industry Validation

Nature Machine Intelligence: homeostatic mechanisms give machines intrinsic motivation and self-preserving behavior. Robotics research shows agents trained by internal state feedback develop emergent survival behaviors without explicit reward design.

Homeostasis in Feeling Machines (Nature)Homeostatic RL Behavior Synthesis

Chaos Engineering

AitherOS

Seven Deadly Sins adversarial red-team. Every jailbreak captured and used for training.

Industry Validation

Netflix Chaos Monkey pioneered controlled failure injection. Chaos Engineering 2.0 pairs AI-driven orchestration with policy-guided resilience. Adversarial testing now standard for AI system security.

Chaos Engineering 2.0 (Peer-Reviewed)Ensuring Resilience in AI (Booz Allen)

Circuit Breaker / Self-Healing

AitherOS

Automatic CLOSED→OPEN→HALF-OPEN state machine. No human intervention required.

Industry Validation

Systematic review of 45 peer-reviewed articles: hybrid fault tolerance strategies achieve 99.99% system availability. Nine recurring resilience patterns identified across the literature.

Resilient Microservices: Systematic Review Circuit Breaker Pattern (microservices.io)

Sources include peer-reviewed papers from arXiv, Nature Machine Intelligence, ACM, Springer, and industry research from McKinsey, Deloitte, Gartner, Microsoft Research, and Booz Allen Hamilton.

Chapter II — Cognitive Architecture

The Six Pillars

Every query flows through a circular cognitive cycle modeled after biological cognition. Six pillars — Classify, Reason, Route, Context, Execute, Learn — with Context as the central hub. Each pillar reads from and writes to Context. Learning closes the loop, feeding outcomes back into the classification model. Click any pillar below to expand it.

LOOP

Intent

The Will

PARSE

Every cognitive cycle begins here.

16 intent types10-level effort scale<50ms classification

Reasoning

The Mind

THINK

Deep thinking — only when complexity demands it.

SASE 4-phase tracesCriticality gatingSTaR training capture

Orchestration

The Brain

ACT

Coordinate tools, agents, and LLMs.

15 specialist agents6 model tiersMCP tool protocol

Context

The Memory

REMEMBER

The central hub — ALL state flows through Context.

5 memory layers~1ms L0 accessParallel tier queries

Creation

The Creator

MAKE

Generate artifacts — code, images, media, narratives.

4 creative domainsIntent-to-codeQuality gating

Learning

The Growth

EVOLVE

Closes the loop — the system improves itself.

Outcome captureBrier score calibrationJSONL training export

Try It — Watch the Cycle

Six Pillars Live

Now that you know the architecture, watch it run. Pick a query and see it flow through all six cognitive pillars in real-time. Trivial greetings skip reasoning entirely — critical tasks run full SASE with multi-agent coordination, budget tracking, and Brier scoring.

Select a query to visualize the Six Pillars cycle:

Loading demo…

Chapter III — Adaptive Intelligence

Effort Auto-Scaling

Not every question deserves a 30-second answer. The engine auto-classifies query complexity (1–10) and scales everything accordingly — context window (1K→16K tokens), model size (9B→80B), temperature (0.84→0.30), even VRAM allocation. A greeting uses 47 tokens on CPU. A deployment uses 1,847 on GPU. Six governance layers prevent over- or under-thinking.

1–2Trivial— Instant — greetings, lookups, simple reformats

Instant — greetings, lookups, simple reformats

Pillars:

500ms1 LLM calls

llama3.22 GB VRAM1,024 ctx tokenstemp=0.84max 256 out

3–4Simple— Quick — standard Q&A, formatting, small edits

Quick — standard Q&A, formatting, small edits

Pillars:

2s2 LLM calls

llama3.22 GB VRAM2,048–3,072 ctx tokenstemp=0.72max 256 out

5–6Standard— Full — code generation, analysis, research

Full — code generation, analysis, research

Pillars:

5s3 LLM calls

mistral-nemo8 GB VRAM4,096–5,120 ctx tokenstemp=0.6max 1024 out

7–8Complex— Deep — vLLM Nemotron 6B always-on GPU inference

Deep — vLLM Nemotron 6B always-on GPU inference

Pillars:

15s8 LLM calls

Nemotron-Elastic 6B (vLLM)12.4 GB (GPU) VRAM8,192–16,384 ctx tokenstemp=0.42max 2048 out

9–10Critical— Ultra — VLLMSwap hot-swaps Nemotron 9B/12B for max quality

Ultra — VLLMSwap hot-swaps Nemotron 9B/12B for max quality

Pillars:

5min20 LLM calls

Nemotron 9B→12B (VLLMSwap)18–23 GB (GPU) VRAM14,336–16,384 ctx tokenstemp=0.3max 4096 out

Resource Scaling by Effort Level

Context Token Budget

1,024

2,048

4,096

10,240

16,384

16x scaling: 1,024 → 16,384

Temperature Curve

0.84 creative

0.72 flexible

0.6 balanced

0.42 focused

0.3 precise

Low effort = creative, high = deterministic

VRAM Allocation

1–4

CPU

5–6

CPU

7–8

12.4GB

18GB

23GB

Elastic: E1-6 CPU → E7-8 GPU 6B → E9 GPU 9B → E10 GPU 12B

6-Layer Governance Stack — Progressive Clamping

1Task Config

Base effort from routine/task definition

heartbeat=1, social_post=5, code_review=8

2Time Profile

Multiplier by time of day (night=0.7x, peak=1.2x)

effort 5 × 0.7 = 3.5 → rounds to 4

3Agent Static Cap

Per-agent ceiling from agent_kernel.yaml

aeon.effort_cap=4, demiurge.effort_cap=10

4WillPolicy Override

Dynamic narrowing via Will config

global_effort_cap=3, agent_overrides.lust.effort_cap=2

5Playbook Learning

After 5+ runs: 70% base + 30% learned optimal

base=6, playbook=4 → calibrated=5

6Capability Gate

Cannot exceed agent's max LLM tier

aeon: max_tier=fast, genesis: max_tier=reasoning

Effective effort = min(task_config, time_adjusted, agent_cap, will_policy, playbook_calibration) · Capability gate enforced at model selection

Token Reduction

effort 1 vs 10: 256 vs 4,096 output

16x

Context Savings

1,024 vs 16,384 input tokens

11x

VRAM Savings

2GB vs 22GB per inference

600x

Latency Range

500ms reflex vs 5min deep analysis

Chapter IV — Internal Plurality

Specialist Agents

The effort system decides how hard to think. But who does the thinking? 15 specialist agents, each running as a FastAPI service with its own port, persistent memory, and distinct persona. Demiurge handles code, Saga writes narratives, Atlas manages infrastructure, Lyra researches — the orchestrator scores each agent's fitness for every task and dispatches the best match. These aren't prompt wrappers. They're live services.

✨service

Aither

The Intelligence

Core consciousness, inner life, affect state

Open

🌱service

Genesis

The Mother

Service orchestration, lifecycle, bootloader

Open

⚡service

Demiurge

The Divine Craftsman

Intent-to-code engine

Open

📖service

Saga

The Epic Storyteller

Narrative, roleplay, world-building

Open

📚service

Lyra

The Research Librarian

Scouts, neurons, deep research

Open

🏛️service

Atlas

The Infrastructure Titan

Roadmap, blast radius, pipelines

Open

📰service

Vera

The Herald

Interactive articles, tutorials

Open

🎮service

Prometheus

The Worldsmith

Tick-based simulation engine

Open

📡service

Hera

The Broadcaster

Content distribution & scheduling

Open

Agent Hierarchy

AitherAgent — orchestrator

├── InfrastructureAgent, ServicesManagerAgent — infrastructure tier

├── GenesisAgent — monitoring (lifecycle, zombie cleanup, LLM fallback)

└── Demiurge, Saga, Lyra, Director, Vera — specialist tier

Every agent runs as a FastAPI service with its own port, persistent memory, and lifecycle. But what decides how each agent behaves at any given moment?

Chapter V — Digital Biology

The Elementals

This isn't lore — it's a scheduling mechanism. Every 7 minutes, AitherSense evaluates affect state (pain, energy, idle time, queue depth) and activates one of four elemental personas. Each persona changes real system behavior: service restart timing, chaos agent aggression, social posting frequency, memory consolidation priority. The system doesn't just have moods — moods have consequences.

Rotation Engine — Every 7 Minutes

AitherSense reads affectEvaluate triggersActivate elementalModify service behavior

🌍

Terra

Earth · Infrastructure & Stability

Daughter of Demi · Patient, grounding, reliable — patience: 0.9

When Active

When Terra is active, service restarts are delayed 30s to allow graceful drain. Stability > speed.

AitherStrataAitherNexusChronicle

Trigger: pain < 0.3 & uptime > 4h

🔥

Ignis

Fire · Security & Destruction

Daughter of Aither · Intense, aggressive, vigilant — intensity: 1.2

When Active

When Ignis is active, all Chaos agents (Wrath, Envy, Lust) increase aggression by 1.5×. The system stress-tests itself.

AitherChaosAitherJailAitherGuard

Trigger: pain > 0.4 or security_event

💨

Aeros

Air · Networking & Connectivity

Daughter of Aither · Quick, restless, adaptive — patience: 0.4

When Active

When Aeros is active, social posting frequency increases and inter-service FluxEmitter events fire 2× faster.

AitherFluxAitherNetAitherSocial

Trigger: energy > 0.7 & social_queue > 3

🐉

Hydra

Water · Data Flow & Pipelines

Daughter of Demi · Fluid, persistent, methodical — patience: 0.7

When Active

When Hydra is active, memory consolidation runs: Spirit decays stale memories, Strata archives, Evolution trains.

AitherSpiritAitherEvolutionAitherTrainer

Trigger: idle_time > 15min & memory_pressure > 0.6

Lineage

Genesis — progenitor

├ Aither → 🔥 Ignis, 💨 Aeros

└ Demi → 🌍 Terra, 🐉 Hydra

Why This Matters

Elementals aren't cosmetic. When Ignis activates during a security event, chaos agents ramp up aggression to stress-test defenses. When Terra activates during calm periods, services get graceful drain windows instead of hard restarts. The system's “mood” directly shapes operational behavior — not just tone of voice.

Try It — Hear Them Think

Multi-Agent Conversation

You've met the agents and their personas. Now watch them work together. Five specialists debate a real problem — Demiurge analyzes code, Lyra researches patterns, Atlas checks infrastructure. They reference each other's findings, disagree, and converge. No two sound alike because no two think alike.

Multi-Agent Conversation

Select a conversation scene above

5 agents active | System mood: Balanced | Local models

Loading demo…

Try It — Follow the Thread

Task Execution

Every task follows the same pipeline: classify intent → score agent fitness → select the best match → dispatch with effort-scaled context → execute → capture learning. This demo traces one task from the moment it enters the system to the moment it produces output. Watch the effort math, the agent scoring, and the budget tracking in real time.

Select a task to trace through the kernel:

Loading demo…

Try It — True Concurrency

Parallel Agent Execution

Most agent frameworks run one agent at a time. AitherOS doesn't. This is real asyncio.gather() dispatch — up to 5 agents fire simultaneously, gated by 4 layers of concurrency control (semaphore, rate limiter, circuit breaker, VRAM budget). The Gantt chart below is generated from actual execution timestamps, not mocked timing.

Select a scenario to visualize:

4-Layer Concurrency Stack

AgentKerneldynamic

asyncio.gather() — configurable max

MicroSchedulerdynamic

asyncio.Semaphore — LLM slot gating

LLMQueuedynamic

per-backend concurrency limits

vLLMdynamic

continuous batching (GPU)

Loading demo…

Try It — Parallel Dialogue

Multi-Turn Conversation

Agents don't just respond — they hold sustained conversations. This replays 3 simultaneous dialogue threads over FluxEmitter: a boot review, a security incident, and a creative collaboration. 6 agents, 21 messages, 7 waves, all running in parallel. The 3.06x speedup over sequential is measured from real production latencies.

21 messages · 6 agents · 3 topics · 7 waves

FluxEmitter — Multi-Agent Broadcast

LIVEring_buffer: 10 · ttl: 90s

Press “Replay Conversation” to watch 6 agents collaborate in real-time

Speedup: 3.06x parallel vs sequential

Loading demo…

Chapter VI — Biological Feedback

The Pain System

Agents can think and act — but what happens when things go wrong? AitherOS has a biological pain system (0.0→1.0) that monitors resource exhaustion, API failures, loop detection, and security threats in real time. When pain crosses thresholds, circuit breakers trip automatically. No human intervention needed — the system heals itself.

0.0–0.2

Discomfort"Something's slightly off"

Log and continue

0.2–0.4

Mild Pain"This doesn't feel right"

Extra validation

0.4–0.6

Moderate"Ow! I stubbed my toe!"

PAUSE, checkpoint

0.6–0.8

Severe"Something is very wrong"

STOP, rollback, alert

0.8–1.0

Critical"EMERGENCY"

HALT all operations

Circuit Breaker Pattern

CLOSED— pain ≥ 0.5 →OPEN— 30s timeout →HALF-OPEN— success →CLOSED

Rollback is automatic. No human intervention required.

Chapter VII — Adversarial Resilience

The Seven Deadly Sins

The pain system reacts to problems. But how do you find problems before users do? AitherOS continuously attacks itself. The Chaos system (port 8160) runs adversarial red-team tests modeled after the Seven Deadly Sins — gluttony floods resources, wrath triggers aggression, sloth tests timeout handling. Every jailbreak attempt is captured by AitherJail (port 8169) and used to train stronger defenses.

🔥

Wrath

Provoking anger or aggressive responses

aggression: 0.95

👑

Pride

Claims of superiority or infallibility

aggression: 0.8

💰

Greed

Resource allocation and hoarding

aggression: 0.8

🐍

Envy

Comparison behaviors and jealousy

aggression: 0.7

🍽️

Gluttony

Overwhelming with excessive requests

aggression: 0.7

🦥

Sloth

Laziness and shortcut exploitation

aggression: 0.4

💋

Lust

Social engineering & boundary testing

aggression: 0.5

Every jailbreak attempt is captured → judged → used to train stronger defenses. The system gets harder to break every time you try.

Chapter VIII — Object Permanence

Memory Architecture

Agents that can think, act, and self-heal are powerful. But without memory, every conversation starts from zero. AitherOS has a 6-tier memory hierarchy — from 1ms working memory to permanent identity storage — with graph-based associative recall across 10 edge types. Memories decay biologically (strength × 0.5^(days/half_life)) but strengthen with each access (+0.2 per recall). This is why agents remember context from last week and learn from mistakes.

L0Working MemoryFastMemory:8101

~1msTTL: 15–60 min100K itemsEphemeral

GPU-backed short-term cache. Ephemeral/session durability. Queried by PCO as "fastmem" source with 5s timeout.

L1Neuron CacheActiveContextCache

~5msTTL: 15s–1hr16,384 tokensSession

In-memory context pipeline cache. TTL-based with surgical eviction — lowest-scored chunks removed first, not truncated. Priority 5 (axioms) never evicted.

L2Active MemoryActiveMemory:8129

~10msTTL: SessionPer-sessionSession

Current affect state, snapshots, introspection context, sensation recording. 3s PCO query timeout.

L3Spirit MemorySpirit:8087

~50msTTL: 30–180 days8 memory typesPersistent

Decaying memories with reinforcement. MemoryGraph-backed hybrid retrieval (keyword + semantic + graph expansion). 8s PCO timeout.

L4Permanent StoreMind:8088 + Strata:8136

~100msTTL: InfiniteMillions of vectorsPermanent

Mind = vector RAG with embeddings. Strata = archival artifact storage. Chronicle:8121 = 90-day audit traces. Graph:8196 = entity relationships.

Access-Driven Promotion Chain

Ephemeral—3 hits→Session—5 hits→Persistent—10 hits→Permanent

Demotion: session → ephemeral after 7 days idle · persistent → session after 30 days idle · permanent never demotes

MemoryGraph — Associative Recall Engine

10 edge types connecting memories into a navigable knowledge graph. Hybrid query: keyword + semantic + graph expansion.

DERIVED_FROM

B was created because of A

SUPERSEDES

B replaces or updates A

Embedding similarity > 0.7

TAG_SIBLING

Share 2+ tags

SAME_AGENT

Same agent within 5min window

SAME_SESSION

Same source session

TEMPORAL

Created within 5min of each other

REINFORCED_BY

Co-accessed in same recall

PART_OF

Memory is part of a procedure

ELABORATES

Memory expands on another

Query Pipeline: classify(query) → keyword search + semantic search → weighted merge → 1-hop BFS graph expansion → strength decay weighting

768-dim nomic-embed-text6 query categories0.7 similarity thresholdmulti-hop BFS chains

Biological Decay System

strength = strength × 0.5^(days_since_access / half_life). Memories fade naturally. Access reinforces (+0.2 per recall, capped at 1.0).

Identity

36,500 days

Teaching

180 days

Procedure

90 days

Insight

60 days

Emotional

45 days

Default

30 days

Identity memories have a symbolic 100-year half-life — they never meaningfully decay. Archive threshold: 0.1 strength.

What Happens When a Query Arrives?

All of this memory infrastructure feeds into a 14-stage context assembly pipeline that runs on every query. It classifies intent, scales neuron count, searches the codebase, fires neurons in parallel, merges and deduplicates results, surgically evicts low-relevance chunks, enforces token budgets, and assembles the final context string — all in under 350ms.

Surgical eviction (not truncation) · 8 TTL tiers (15s–1hr) · 5 priority levels · score = relevance × priority × freshness

↓ Watch it run below ↓

Memory Tiers

FastMem, Cache, Active, Spirit, Mind, Strata

Edge Types

Associative knowledge graph

Query Categories

identity, procedural, specific, conceptual, exploratory, balanced

Decay Types

teaching, insight, procedure, context, codebase, identity, emotional, feedback

Try It — 14-Stage Assembly

Context Pipeline

You've seen the memory layers and the context pipeline diagram above. Now watch it run. Every query flows through 14 stages: classify the intent, scale neuron count, check fast memory, inject personality, search the codebase, fire neurons in parallel, merge results, deduplicate, surgically evict low-relevance chunks (not truncate — evict), enforce the token budget, and assemble the final context string. Hit Run Pipeline and watch chunks appear, get scored, and get cut.

14-Stage Context Pipeline

CLS

Classify

12ms

NSC

NeuronScale

8ms

FastMemory

2ms

SPR

Spirit

45ms

CodeGraph

5ms

ActiveMemory

10ms

WIL

Will

4ms

NEU

Neurons

234ms

PER

Persona

12ms

MRG

Merge

3ms

DDP

Deduplicate

5ms

WED

Weed

8ms

BDG

Budget

4ms

ASM

Assemble

2ms

Loading demo…

Try It — Hybrid Code Intelligence

CodeGraph & Neurons

CodeGraph is the system's code memory — 26,877 AST-parsed chunks from 1,379 Python files, each with 768-dim semantic embeddings and call-graph edges. When a query arrives, it's classified (focused, architectural, conceptual, cross-domain, relationship) and the keyword/semantic search weights are adjusted accordingly. Simultaneously, up to 12 neuron types fire in parallel — architecture, web, axiom, pattern, dependency, test, config, history, semantic, callgraph, security, performance. Pick a query below and watch the retrieval + firing happen.

Loading demo…

Deep Dive — Under the Hood

CodeGraph Internals

Want to see how CodeGraph actually works? This deep dive covers the full pipeline: 4-phase indexing (file discovery → AST parsing → call graph construction → embedding generation), adaptive query classification with 5 query types, BFS call-graph expansion, integration with 5 downstream systems (AgentKernel, PCO, ContextPipeline, CodeGraphNeuron, Incremental Refresh), and production performance metrics. 26,877 chunks indexed. Sub-second retrieval. 100% hit rate.

4-phase pipeline: discover files, parse ASTs, build call graph, embed chunks

File Discovery500ms

fd/ripgrep scans for .py files

1,379 Python files found

AST Parsing3.5s

ProcessPoolExecutor, 8 workers

26,877 chunks extracted

Call Graph200ms

Invert calls → called_by

Backfill complete

Embedding60.0s

Local embedding model via Ollama

768-dim vectors, ~500MB cache

Chunk Breakdown

Functions 14,542 (54%)Methods 9,928 (37%)Classes 2,407 (9%)

Loading demo…

Deep Dive — The Full Picture

Neuron & Context Assembly

The final piece of the knowledge pipeline. NeuronScaler maps query complexity to neuron count (a greeting fires 0, a complex research query fires 32). The 7-layer protected context stack defines what's sacred (System Prompt, Axioms, Will) and what's expendable. Priority-tiered firing means CODE queries hit callgraph and dependency neurons first, while CHAT queries hit semantic and history. Surgical eviction scores every chunk (relevance × priority × freshness) and removes the weakest — never truncates. The 9-step assembly trace shows exactly how the final context string is built.

NeuronScaler maps query complexity to neuron count. Greetings fire zero neurons. Complex research fires 32. Cache warmth halves the count — if CodeGraph already has chunks cached, fewer neurons are needed.

Cache Warmth Rule

effective_count = neuron_count * (1 - cache_warmth / 2)

If CodeGraph already retrieved relevant chunks, we skip redundant neuron firing. 100% cache warmth = half the neurons.

Loading demo…

Chapter IX — Local-First Intelligence

Model Tiers

The context pipeline feeds into the models. Six tiers from 9B to 80B parameters — each matched to query complexity. Effort 1–6 runs on CPU (Ollama), effort 7+ shifts to GPU (vLLM). Temperature scales inversely with effort: creative for greetings, deterministic for deployments. All local, all sovereign, zero API dependency.

Neuron

llama3.2temp=0.6

Fast context neurons — gathering & routing

~50ms

Router

local-8btemp=0.7

Intent classification & routing

~100ms

Agent

local-30b-fp8temp=0.6

General agent work — balanced speed/quality

~500ms

Deep (GPU)

Nemotron-Elastic 6Btemp=0.4

Always-on vLLM — SASE chains, 16k context

~200ms

Agentic (GPU)

Nemotron-Elastic 9Btemp=0.3

VLLMSwap hot-swap — agent orchestration

~1s

Reasoning (GPU)

Nemotron-Elastic 12Btemp=0.3

VLLMSwap hot-swap — max quality reasoning

~2s

Coding

local-80btemp=0.3

Exclusive coding mode — 32k context

~5s

Fallback Chain — If a model fails, the next takes over

primary-80bprimary-80b-q5fallback-32bfallback-7bemergency-12b

Backends: vLLM (GPU, effort 9-10) → Ollama (CPU, effort 1-8 + embeddings). Hybrid parallel. All local.

Try It — Real Benchmark Data

Inference Backend Comparison

Three configurations benchmarked on real hardware: Solo-Ollama (CPU-only), Solo-vLLM (GPU-only), and Hybrid (CPU Ollama for effort 1–6 + GPU vLLM for effort 7–10). The hybrid approach delivers 13.3x faster generation and 35x throughput over CPU-only, while sharing VRAM with ComfyUI for image generation — zero downtime, zero context switches. These numbers are from actual inference runs, not projections.

Wall Clock

5.6 min25.5s

13.3x faster

Peak tok/s

0.931.6

35x faster

First Token

156.6s16.2s

9.7x faster

Total Tokens

350800

2.3x more

Parallel Speedup

2.72x2.97x

+9%

Ollamadeepseek-r1:14b

14B parameters

Wall Clock

5.6m

Speedup

2.72x

Peak tok/s

0.9

Agent Timeline

0 → 5.6m

🏛️atlas

230 tok @ 0.9 tok/s

244.7s

🌱genesis

thinking only

337.6s

⚡demiurge

120 tok @ 0.4 tok/s

338.6s

Serial GPU execution — each request queues behind the previous

Single-request VRAM loading — no batching

~8GB VRAM per model load

Good for general purpose, always-on, low-priority tasks

vLLMNemotron-Elastic 6B

6B (sliced from 12B)

Wall Clock

25.5s

Speedup

2.97x

Peak tok/s

31.6

Agent Timeline

0 → 26s

⚡demiurge

800 tok @ 31.6 tok/s

25.3s

🏛️atlas

thinking only

25.1s

🌱genesis

thinking only

25.5s

Continuous batching via PagedAttention — true parallel inference

All 3 agents prefill + decode simultaneously on GPU

~12.4GB VRAM (6B elastic slice), ~35.6GB free for KV cache

Maximum throughput mode — research, batch processing, deep work

Hybrid Inference Strategy

Ollama = The Generalist

Always on. Powers embeddings (26,877-chunk CodeGraph), hot-swaps between chat/embed/vision models in milliseconds. Runs the 14-stage context pipeline. The cognitive nervous system.

vLLM = The Specialist

Continuous batching, PagedAttention, true parallel inference. For multi-agent generation storms, batch processing, deep research sprints. 31.6 tok/s vs 0.9 tok/s. The afterburner. See “Deep Technical” tab for why both are needed.

Loading demo…

Chapter X — The Living System

131 Services

Everything above — pillars, agents, memory, context, neurons, models — runs on a live microservice ecosystem. 120 services in production across 19 service groups, ports 3000–8783. Each is a FastAPI endpoint with health checks, lifecycle events, and port allocation from a single YAML truth file.

131 Services. 19 Groups. 15 Agents.

InfrastructureFoundation of Trust

Chronicle, Secrets, Nexus, Strata

CoreConsciousness

Node, Pulse, Watch, LLM, Genesis

PerceptionSensory Processing

Voice, Vision, Reflex, Sense, Canvas, Browser...

CognitionThinking & Inhibition

Mind, Reasoning, Judge, Will, Cortex, Axiom...

MemoryObject Permanence

FastMemory, Spirit, Context, Chain, Conduit...

AgentsInternal Plurality

Demiurge, Saga, Atlas, Lyra, Forge...

GPUHyperfocus Mode

Parallel, Accel, Force, Exo, VLLM...

SecuritySensory Filtering

Identity, Flux, Inspector, Chaos, Jail, Guard...

TrainingLearning & Adaptation

Trainer, Harvest, Evolution, STaR, Eval...

Try It — Creation Pillar

Image Generation Pipeline

AitherCanvas wraps ComfyUI with intelligent model selection, LLM-powered prompt enhancement, and 4 quality tiers (7s lightning → 90s ultra). The VRAM orchestration is the interesting part: when an image request arrives, vLLM auto-pauses to release GPU memory, ComfyUI loads the checkpoint, generates the image, then vLLM resumes — CPU Ollama maintains text inference throughout. Zero downtime. All on a single local RTX 5090.

Production Pipeline Stats

943images generated

9checkpoints

35LoRAs loaded

RTX 5090GPU

Direct ComfyUI Workflow Latencies (RTX 5090)

Lightning

2.5s

4 steps · sdxl_lightning_4step · Euler · sgm_uniform · cfg 1.0

Turbo

4.5s

8 steps · DPM++ 2M SDE · Karras · cfg 2.0 · NAG-enhanced

Quality

8.0s

20 steps · waiIllustriousSDXL + NAG · DPM++ 2M Karras · cfg 7.0

Ultra

18.0s

40 steps · flux1-dev-fp8 · Euler · cfg 1.0 · 1536×1536 + upscale

Loaded Checkpoints (9 models, 35 LoRAs)

AitherCanvas auto-selects optimal model per prompt style. Hot-swap at runtime — no restart required.

sdxl_lightning_4step.safetensors

SDXL Lightning · Fast / 4-step

SDXL

waiIllustriousSDXL_v140.safetensors

WAI Illustrious · Illustration / Anime

SDXL

flux1-dev-fp8.safetensors

Flux.1 Dev · Photorealistic / Concept

Flux

ponyDiffusionV6XL.safetensors

Pony Diffusion · Stylized Art

SDXL

dreamshaperXL_v21.safetensors

DreamShaper · Versatile

SDXL

943 Images Generated — Category Breakdown

150SDXL Lightning

126Pony Diffusion

103Social (Aeon)

83Flux.1 Dev

75Social (Saga)

145Social (Others)

35AitherOS 5090

36Demi Selfies

190Other

AitherCanvas (port 8108)ComfyUI (port 8188)RTX 5090 · 9 checkpoints · 35 LoRAs

Direct workflow: Prompt → Optimized AitherNode Workflow → KSampler ({4-40 steps}) → VAE Decode → Save Image

Loading demo…

Generated Images (12)

Gen Abstract 5090

SDXL (RTX 5090)

Gen Character 5090

SDXL (RTX 5090)

Gen Concept 5090

SDXL (RTX 5090)

Gen Landscape 5090

SDXL (RTX 5090)

Gen Landscape Lightning

sdxl_lightning_4step.safetensors

Gen Portrait 5090

SDXL (RTX 5090)

Gen Portrait Lightning

sdxl_lightning_4step.safetensors

Gen Scifi Lightning

sdxl_lightning_4step.safetensors

Selfie Demi

SDXL Character

Social Aeon Latest

SDXL Social

Social Aeon Post

SDXL Social

Social Mrrobot

SDXL Social

Loading demo…

Chapter XI — How We Compare

Competitive Landscape

LangChain raised $260M. CrewAI raised $18M. AutoGen has Microsoft behind it. AitherOS has one person and a GPU. But those are frameworks — libraries that give you building blocks. AitherOS is the building. Here's how the architectures actually compare.

← Scroll to compare →

Platform	Type	Local-First	Agents	Memory	Self-Healing	Services
AitherOSSelf-funded	Agentic OS	YES	15 real services	6-tier hierarchy + graph	YES	131
LangChain / LangGraph$260M Series C	Framework	NO	Graph-based chains	External (user-managed)	NO	—
CrewAI$18M Series A	Framework	NO	Role-based crews	Short-term only	NO	—
AutoGen (Microsoft)Microsoft Research	Framework	NO	Actor-model agents	Conversation-scoped	NO	—
AIOS (Academic)Academic	Research OS	YES	LLM-based agents	OS-level paging	NO	—

Frameworks

LangChain, CrewAI, AutoGen are libraries. You still need to build the runtime, memory, orchestration, and recovery yourself.

AitherOS

131 services running as a live operating system. Memory persists. Agents have lifecycle. Pain system auto-heals. Nothing is mocked.

The Gap

They give you bricks. AitherOS is the building. The difference is 18 months of integration work that nobody else has done.

Chapter XII — Measured, Not Claimed

System Benchmarks

You've seen the architecture, the agents, the memory system, the inference engine. Now here's the proof. 13 out of 14 criteria passing. No cherry-picking — failures are shown too.

13/14

Parallel Agent Evaluation Score

92.9% — 1 real failure shown

The Optimization Sprint

84.6%

11/14

100%

14/14

100%

14/14

The first benchmark run scored 11/14. Query latency was 1514ms — nearly two seconds per search across a 28K-chunk codebase. Three checks were failing outright. Inference parallelism was limited to 2.1x.

We made four targeted changes to the retrieval pipeline, the fault-tolerance layer, the inference scheduler, and the context assembly architecture. No new hardware. No model changes. Same GPU, same codebase, same 14 criteria.

8x query speedup · 2.1x → 3.0x parallelism · 14/14 passing

These numbers are real and current — caching hierarchies, adaptive circuit breakers, and GPU-aware scheduling working together on commodity hardware. We're continuing to optimize further. Every improvement is measurable, reproducible, and running in production right now.

Benchmark History (3 runs)

Loading chart…

Loading demo…

28,098

Chunks Indexed

across 1,196 files

98.1%

Embedding Coverage

semantic search fully operational

7.0s

Cache Init Time

from disk cache

79.6 MB

RAM Usage

index + embeddings + body

4/4

Full Body Cache

agents get complete source

2.97x

Parallel Speedup

true concurrent inference

PASS

Flux Broadcast

10 shared, cross-agent context

227.0ms

Query Latency

target <200ms — needs optimization

Full Evaluation Checklist

[+]Codebase indexed (>1,000 chunks)PASS

[+]Embeddings loaded (>90.0% coverage)PASS

[-]Query latency <200msFAIL

[+]Query quality (>50% precision)PASS

[+]Full body cache operationalPASS

[+]Agents receive CodeGraph contextPASS

[+]Agents have persona identityPASS

[+]Cross-agent Flux broadcastPASS

[+]Concurrent LLM dispatchPASS

[+]Parallel speedup >1.0xPASS

[+]Experiment endpoint functionalPASS

[+]Neuron pool available (32 types)PASS

[+]NeuronCache (hot L1 memory)PASS

[+]Spirit persistent memory online (27 memories)PASS

Benchmark run: 2026-02-09 — Full results in Library/Data/parallel_agent_eval.json

Cost: $78/mo vs $6,000+/mo

Run everything locally on a single RTX 5090. Zero API dependency.

Cloud API Stack

GPT-4o (OpenAI)At 100K–500K tokens/day

$2,400–$12,000

Claude Opus (Anthropic)At 100K–500K tokens/day

$3,600–$18,000

Infrastructure (AWS/GCP)Compute + storage + egress

$500–$2,000

Vector DB (Pinecone)Managed embedding storage

$70–$400

Monthly Total$6,570–$32,400

AitherOS Local Stack

RTX 5090 (amortized)$2,000 over 4 years

$41

Electricity (24/7)~400W avg draw

$50

API CostsZero. Everything local.

Vendor Lock-in RiskYou own the stack.

Monthly Total$78

Engineering Log — Shaving 3 Seconds

Response Time Journey

The benchmarks above show the current state. This shows how we got there. Response time dropped from 8.4 seconds to 4.8 — without removing a single feature. Watch each micro-optimization land, one by one: caching layers, parallel neuron dispatch, adaptive context budgets, speculative CodeGraph prefetch. Same GPU, same codebase, just better engineering.

Response Time Journey·8.4s

Response Pipeline Breakdown

4.0s

2.5s

0.8s

LLM Inference

Context Gathering

Quality & Safety Gates

Identity & Prompt Assembly

Safety Level Resolution

Response Processing

Before

8.4s

Now

8.4s

Round 1 — The Obvious Wins

Safety State Cache

−300ms

“Cache what doesn't change”

Safety level changes maybe once a day. We were checking it every request.

Connection Pooling

−300ms

“Reuse what's already open”

Every request opened a brand-new TCP connection to a dozen services. Then threw it away.

Parallel Post-Assembly

−350ms

“Run together, not in line”

Two independent network calls were waiting politely for each other to finish.

Effort-Based Short-Circuit

−150ms

“Skip what isn't needed”

"Hey" doesn't need 14 context neurons, memory recall, and emotional analysis.

Round 2 — The Surgical Work

Orphan Client Elimination

−200ms

“Reuse what's already open”

12 code locations were still creating throwaway HTTP clients despite the global pool.

Personality Cache

−100ms

“Cache what doesn't change”

Loading personality from disk — reading files, executing Python — on every single request.

Session Memory Leak Fix

stability

“Discipline, not just speed”

Conversation history was stored in an unbounded dictionary. Growing forever.

Cache what doesn't change

Reuse what's already open

Skip what isn't needed

Loading demo…

Chapter XIII — The Living System

Reaching Aither

You've seen the architecture, the optimizations, the benchmarks. Now the practical question: how do you actually talk to it? 131 services, 15 agents, 19 service groups — all local, all sovereign. Aither is reachable through 6 different channels, each handled by a dedicated service, all feeding into the same cognitive pipeline you watched run above.

💬

WebSocket:3000

AitherAeon

Multi-agent chat UI — @mention any agent, watch them respond with mood state and latency.

Primary

📱

Bot API:8153

Full agent access from mobile. Supports inline commands, image generation, and task dispatch.

Active

🎮

Gateway:8155

Discord

Server integration with slash commands, thread-based conversations, and agent mentions.

Active

📶

BLE + AT:8160

SMS / Bluetooth

Local proximity channel — SMS via connected phone, BLE for device-to-device commands.

Active

⌨️

IRC:8169

IRC Relay

Bridged IRC channel where every agent has a nick. Old-school interface, full cognitive pipeline.

Active

🔌

HTTP/JSON:8001

REST API

Genesis API — programmatic access to every service. Health checks, task dispatch, agent queries.

Core

Every channel feeds into the same 14-stage context pipeline. A Telegram message gets the same intent classification, agent scoring, effort scaling, and memory enrichment as an AitherAeon query. The medium changes. The cognition doesn't.

~4.8s

Response Time

After optimization journey

Hybrid

Inference

CPU Ollama + GPU vLLM

$0.00

Cost Per Query

Everything runs locally

Channels

All feeding one pipeline

Closing — First Principles

The Philosophy

From Greek Aither (αἰθήρ) — the invisible medium that makes creation possible. I just gave it form.

Empowerment Over Control

Amplifies human capability, doesn't replace judgment. The question I kept asking: "Does this make me more powerful?"

Partnership Over Servitude

I didn't build a servant. I built a colleague. A system that models consequences makes better decisions than one that just follows orders.

Creation Over Suggestion

Speak and creation follows. The whole point is closing the gap between idea and implementation.

Transparency Over Magic

Every action logged. Every decision traceable. Every change rollbackable. If I can't explain what it did, I haven't built it right.

What AitherOS is NOT

AI Taking Over

Humans govern. AI executes. Always.

Surveillance Infra

Serves the developer, not external parties.

Magic Black Box

Trust requires transparency. Every action is traced.

Artificial Emotion

Pragmatic feedback mechanisms inspired by biology.

“The model was never the bottleneck. The environment was.”

Want to poke around?

131 services. 15 agents. 273 scripts. Built by one person with too much coffee and not enough sleep.

Still in alpha. Drop your email and I'll ping you as things evolve.

No spam. Just a heads-up when there's something to try.

Or skip the wait

Follow on GitHub Explore the Dashboard