Every time you send a message to Aither, something remarkable happens before the language model sees a single token. A 10-stage pipeline fires across a dozen parallel data sources — memories, code graphs, system telemetry, emotional state, real-time awareness — and surgically assembles a context window that makes the difference between a generic chatbot response and an answer that actually knows what it's talking about.

This is the story of how that pipeline works, why we built it the way we did, and what happens in those critical milliseconds between your message and Aither's first token.

The Problem: LLMs Don't Know Anything About Right Now

Large language models are frozen in time. They know what was in their training data, and nothing else. Ask a vanilla LLM "how many tools do you have?" and it will confidently hallucinate a number. Ask it "what time is it?" and it will either refuse or guess. Ask it about a conversation you had yesterday and it draws a blank.

The standard solution is RAG — Retrieval-Augmented Generation. Stuff some documents into the prompt and hope for the best. But RAG is a blunt instrument. It doesn't know what kind of context you need, how much to retrieve, or what to do when you've retrieved more than fits in the context window.

AitherOS takes a fundamentally different approach. Instead of treating context as a retrieval problem, we treat it as an assembly problem — a multi-stage pipeline where each stage makes intelligent decisions about what the model needs to know, how much of it to gather, and how to fit it all within a token budget without losing anything critical.

The Architecture at a Glance

User message
  → IntentClassifier (effort 1-10, category, needs_tools)
  → EffortScaler → ExecutionPlan (model, context_depth, reasoning_depth)
  → ContextPipeline.assemble() → 10 stages
     ├─ Stage 0:   Connect daemon (warm cache between requests)
     ├─ Stage 1:   Classify (which context layers does this need?)
     ├─ Stage 2:   Scale (how many neurons? what token budget?)
     ├─ Stage 3:   Cache hit + spillover recall
     ├─ Stage 3.5: Flux (real-time system telemetry, <1ms)
     ├─ Stage 4-5.9: Parallel gather (6 sources simultaneously)
     ├─ Stage 6:   Ingest (bridge all formats into unified chunks)
     ├─ Stage 7:   Weed (remove sensitive/injected content)
     ├─ Stage 8:   Budget (surgical eviction, never truncation)
     └─ Stage 9:   Score (quality assessment)
  → build_system_message() → 15 ordered context layers
  → LLMGateway → vLLM/Ollama → Response

Total time: 300ms–3000ms depending on effort level. Let's walk through each piece.

Stage 0: The Daemon Connection

Before the pipeline even starts, there's infrastructure running in the background. The NeuronDaemon maintains a warm cache of frequently-needed context — system status, recent events, service health. Between requests, it's quietly refreshing this data so that when a request does arrive, the most common context sources are already hot.

Stage 0 registers a callback with this daemon: "when you get new data, feed it into my cache." This means the pipeline doesn't start cold. It starts with whatever the daemon has already gathered.

Stage 1: Classify — What Does This Message Actually Need?

Not every message needs the same context. "Hello" doesn't need code graphs and service topology. "How does the AgentForge dispatch pipeline work?" absolutely does.

The SmartContextManager analyzes the message and determines which context layers are relevant:

IDENTITY — always present (who am I?)
WILL — behavioral persona (how should I act?)
SPIRIT — long-term memory (what do I remember?)
PROJECT — real-time project state
WORKING — session-scoped working memory
LIVENEWS — web/external knowledge
NEURONS — semantic search results
HISTORY — conversation history
CURRENT — immediate sensory input

A greeting might only need IDENTITY and WILL. A technical question needs NEURONS, SPIRIT, PROJECT, and possibly HISTORY. This classification prevents us from wasting tokens (and latency) gathering context that won't be used.

Stage 2: Scale — The Neuron Budget

This is where things get interesting. The NeuronScaler determines how many parallel semantic searches ("neurons") to fire, based on three factors:

Query complexity classification:

Complexity	Neurons	Example
Greeting	0	"hey", "thanks"
Simple fact	2	"what port is Genesis on?"
Code question	8	"how does FluxEmitter work?"
Multi-domain	16	"compare the memory and context pipelines"
Research	24	"audit all security boundaries in the system"
Complex research	32	"design a new agent orchestration pattern"

Cache warmth discount: If the cache already has >60% of the tokens we'd need, we halve the neuron count. No point re-searching what we already have.

Execution profile modulation: If the EffortScaler says this is a TDD workflow, we guarantee at least 5 neurons. Coding tasks get at least 4. These floors prevent under-contextualization for specialized workflows.

The scaler also derives the token budget using what we call the "Essay Principle" — we take the minimum of the model's context capacity and the query's actual needs. A simple question shouldn't fill a 128K context window just because the model supports it.

Stage 3: Cache Hit and Spillover Recall

Before firing any new searches, we check what's already in the ActiveContextCache. This is a chunk-addressable store where every piece of context is a discrete object with:

content — the actual text
source — where it came from (code, memory, web, flux, etc.)
relevance — 0.0 to 1.0 semantic match score
priority — 1 (misc) to 5 (axioms)
TTL — time-to-live before stale (flux: 15s, memory: 300s, axioms: 3600s)
content_hash — SHA256 for deduplication

Stale chunks get expired. Then comes the clever part: spillover recall.

When chunks were evicted from the cache in previous requests (because we exceeded the token budget), they weren't deleted. They were demoted to a spillover tier — a secondary store indexed by keywords. Now, if the current query has keyword overlap with that evicted content (measured by Jaccard similarity), we pull it back in at a lower priority.

This means context is never truly "lost." It cascades through tiers:

Active Cache (Tier 0) → KernelContextBus (Tier 1) → MemoryBus (Tier 2)

Each tier is searchable. Content that was relevant two conversations ago can resurface if it becomes relevant again.

Stage 3.5: Flux — The Nervous System

FluxEmitter is AitherOS's real-time telemetry bus. Every service in the system pushes state updates into it — health checks, tool registrations, agent activities, GPU load, git branch, emotional state. The FluxContextState singleton holds all of this as an in-process data structure.

Injecting Flux context costs less than 1 millisecond. No HTTP calls, no database queries. It's just reading a Python object that's continuously updated by event handlers. The result is a compact snapshot:

Time: 14:30:05
Platform: Windows 11 Pro, RTX 5090 (31GB VRAM)
Services: 78/85 healthy
GPU: 4.2GB/31GB used (aither-orchestrator loaded)
Git: develop (3 modified files)
A2A Network: 27 agents connected
MCP Tools: 346 tools across 76 categories

This goes into the [SYSTEM STATE] layer of the system prompt. The model always knows the current state of the world — no tool calls needed for basic system awareness.

Stages 4–5.9: The Parallel Gather

This is where most of the latency lives, and where parallelism matters most. Six independent context sources fire simultaneously via asyncio.gather:

Stage 4: AitherContextAssembler

Fetches three things in parallel:

Will — the agent's behavioral persona from identity YAML files
Spirit — long-term episodic memory with decay-based relevance
Affect — emotional state (valence, arousal, mood)

Stage 5: ParallelContextOrchestrator (Neurons)

The scaled neuron count fires here. Each "neuron" is a targeted semantic search across the codebase, documentation, and knowledge base. On an Ultra-tier system (RTX 5090, 32 threads), this means up to 32 parallel searches with progressive aggregation.

Progressive aggregation is key: we don't wait for all 32 neurons to finish. Once we hit a confidence threshold (0.85 on Ultra, 0.70 on Minimal), we return early with what we have. The slowest neuron doesn't block the pipeline.

Stage 5.5: Knowledge Graph Bridge

Queries the unified knowledge graph for concept nodes, code relationships, and cross-domain links. This is where "how does X relate to Y?" questions get their structural context.

Stage 5.7: Memory Bus

A unified query across all memory tiers — Spirit (episodic), Working (session), and Graph (semantic). The cascade ensures that if Spirit doesn't have a relevant memory, Working and Graph are still checked.

Stage 5.7.5: Conversation Recall

Searches past conversations across sessions. If you discussed something three days ago and reference it now, this stage finds it.

Stage 5.9: Recursive Refinement

The most sophisticated stage. This implements RLM-style (Recursive Language Model) gap analysis: after the initial gather, it examines what's missing. If the neuron results don't cover a key aspect of the query, it fires targeted follow-up searches to fill the gap. Then it compresses redundant results via map-reduce.

Stage 6: Ingest — The Great Unification

Every gather stage returns data in a different format. PCO returns ContextChunk objects. The ContextEngine returns ContextItem objects. The Synthesizer returns yet another format. Stage 6 bridges all of them into ActiveContextChunk — our unified format with consistent scoring, TTL, and priority.

Every chunk gets a SHA256 content hash. Duplicates are caught here — if two different sources returned the same information, we keep only the higher-priority version.

Stage 7: Weed — Security Filtering

Before anything goes into the prompt, it passes through the security filter. This stage catches:

Prompt injection patterns — attempts to override system instructions via retrieved content
Sensitive data leakage — internal credentials, API keys, PII that made it into a knowledge base
Homoglyph attacks — Unicode characters that look like ASCII but could bypass filters (we normalize to NFKC)

Every piece of retrieved context is untrusted by default. The model's system prompt is sacred territory, and we sanitize everything that enters it.

Stage 8: Budget — Surgical Eviction

This is the stage that makes or breaks context quality. We have a hard token budget (derived from the model's context window minus a 1,500-token response reserve), and we probably gathered more context than fits.

The naive approach is truncation: cut everything after N tokens. We never do this.

Instead, we use surgical chunk-level eviction. Every chunk has a composite score:

score = (relevance × priority_weight × freshness_decay) + access_boost

relevance — semantic match to the query (0.0–1.0)
priority_weight — source importance (axioms=5 can never be evicted)
freshness_decay — how old the chunk is relative to its TTL
access_boost — recently-accessed chunks get a bump (LRU-like)

The lowest-scoring chunks get evicted first. If a chunk has a group_id, its entire group goes together — you don't want half a code explanation without the other half.

Evicted chunks aren't deleted. They cascade to the spillover tier for potential recall in future requests.

Stage 9: Score — Quality Assessment

The final stage computes a quality score for the assembled context:

Token coverage by source type
Layer completeness (did we get all the layers Stage 1 said we needed?)
Freshness distribution (how stale is the average chunk?)
Eviction ratio (what percentage of gathered context survived the budget?)

This score feeds into Context X-Ray, our observability system, where you can inspect any pipeline run forensically — seeing exactly which chunks survived, which were evicted, and why.

The System Prompt: 15 Layers, Carefully Ordered

After the pipeline completes, build_system_message() takes all the gathered context and assembles the final system prompt. The layer ordering is deliberate and serves two purposes: information priority and prompt caching.

Layer 0:    [AXIOMS]           — Immutable truths, identity protection
Layer 1:    [IDENTITY]         — "You are Aither, created by Alex"
Layer 1.5:  [SOUL]             — Personality overlay (voice, style)
Layer 1.8:  [GIT WORKFLOW]     — Branch model, CI/CD rules
Layer 2:    [SYSTEM KNOWLEDGE] — AitherOS architecture summary
Layer 2.5:  [PARTNER CONTEXT]  — Partner-specific knowledge
Layer 3:    [CAPABILITIES]     — What tools and services are available
────────────── prompt caching boundary ──────────────
Layer 4:    [SYSTEM STATE]     — Live Flux telemetry
Layer 4.5:  [SYSTEM FACTS]     — Authoritative inventory numbers
Layer 5:    [CURRENT TIME]     — Exact time (authoritative, non-overridable)
Layer 6:    [CONTEXT]          — Neuron search results (1,600 token budget)
Layer 7:    [MEMORIES]         — Spirit/WorkingMemory recalls (1,100 tokens)
Layer 7.5:  [SESSION LEARNING] — Short-term behavioral adaptations
Layer 8:    [AFFECT]           — Emotional state (250 tokens)
Layer 9:    [AWARENESS]        — JarvisBrain real-time briefing
Layer 10:   [RESPONSE FORMAT]  — Output shape for machine callers

The Prompt Caching Boundary

Layers 0–3 are stable across requests. Your identity, capabilities, system knowledge, and axioms don't change between messages. This means vLLM's Automatic Prefix Caching (APC) can cache this entire prefix — roughly 3,000–5,000 tokens — and reuse it for every request. Only Layers 4+ change per turn.

This is why [CURRENT TIME] is at Layer 5, not Layer 0. Time changes every second. If we put it before the stable prefix, we'd invalidate the entire cache on every request. By placing it after the boundary, we get prompt caching for free on the expensive identity/knowledge layers.

The Trained Model Optimization

When the model is aither-orchestrator (our fine-tuned Qwen 2.5), it has internalized the system knowledge, personality, git workflow, and response format through SFT/DPO training. The model_is_trained flag tells build_system_message() to skip these layers entirely — saving approximately 2,800 tokens per call. The trained model gets a minimal identity line ("You are Aither.") and jumps straight to dynamic context.

AutoNeuronFire: The Subconscious

Running alongside the main pipeline is AutoNeuronFire — Aither's automatic data injection layer. While the pipeline is an explicit, staged process, AutoNeuronFire is more like a reflex.

It maintains 36 pattern detectors across categories like time, weather, system status, code context, git state, resource usage, GPU status, active agents, alerts, and — as of this week — tool inventory, agent inventory, and service inventory.

When you ask "how many tools do you have?", AutoNeuronFire's TOOL_INVENTORY detector fires (priority 8, TTL 120s). Its fetch function scans the MCP tool directory on disk, counts tool definitions by parsing Python source files, and checks Flux state for live tool counts. The result goes into the neuron context as an explicit fact:

"AitherOS has 346 MCP tools across 76 categories. Largest: fs(42), git(15), deploy(12)."

The model doesn't need to guess. It doesn't need to call a tool. The answer is already in its context before it generates a single token.

Effort-Driven Context Depth

Not every question deserves the full pipeline. The EffortScaler maps intent-classified effort levels (1–10) to context depth:

Effort	Context Depth	What Gets Gathered
1–2	axioms	Time, identity. That's it.
3–4	fast	+ working memory, basic neurons (~500–1,000 tokens)
5–6	full	+ spirit memory, persona, full neuron battery (~2,000–3,000 tokens)
7–8	full_graph	+ knowledge graph, code graph, recursive refinement (~3,000–5,000 tokens)
9–10	all_layers	ALL memory tiers in parallel + graph + compress (~8,000–16,000 tokens)

A "hi" gets effort 1 and axioms-only context. "Audit the security boundaries across all 203 services" gets effort 9 and the full battery. The system scales its cognitive investment to match the complexity of the question.

Context X-Ray: Full Observability

Every pipeline run produces a PipelineSnapshot captured in a ring buffer (50–500 snapshots). Each snapshot contains:

Stage timings — how long each of the 10 stages took
Surviving chunks — every piece of context that made it into the final prompt, with source, relevance score, priority, freshness, and token count
Evicted chunks — everything that was gathered but didn't make the cut, with the score that caused its eviction
Layer decomposition — the final system prompt parsed into tagged sections with per-layer token counts
Quality score — overall context quality (0.0–1.0)

This is how we debug context issues. When the model gives a wrong answer, we can pull up the X-Ray snapshot and see exactly what it knew (and didn't know) when it generated that response. Was the relevant memory there but evicted by the budget? Was the neuron search too narrow? Did a security filter strip something it shouldn't have?

The Numbers

Component	Latency	Token Budget
Flux injection	<1ms	~100 tokens
CodeGraph search	~2ms	200–500 tokens
AutoNeuronFire detection	~5ms	0 (pattern matching only)
Cache hit check	~10ms	varies
Spillover recall	~50ms	1,000–2,000 tokens
Full pipeline (E5)	~800ms	4,096 tokens
Full pipeline (E9)	~2,500ms	16,384 tokens
Prompt cache hit (Layers 0–3)	<10ms	3,000–5,000 tokens saved

The entire pipeline, from message receipt to first LLM token, stays under 3 seconds even at maximum effort. For simple questions (effort 1–3), it's under 200ms.

What Makes This Different

Most AI systems treat context as an afterthought — a RAG retrieval tacked onto the side of an LLM call. AitherOS treats context as the primary engineering challenge. The LLM is almost secondary; it's the context assembly that determines whether the response is informed or hallucinated.

Three design decisions make this work:

1. Never truncate, always evict. String truncation destroys semantic boundaries. A sentence cut in half is worse than no sentence at all. Our chunk-level eviction removes the least valuable complete pieces of context, preserving the integrity of everything that survives.

2. Spillover, not deletion. Evicted context cascades through tiers. Nothing gathered is ever truly lost — it's just deprioritized. If it becomes relevant again in a future turn, spillover recall brings it back. This gives the system a form of "peripheral awareness" — things it's not actively thinking about but can access if prompted.

3. Effort-proportional investment. Simple questions get simple context. Complex questions get the full battery. This isn't just about saving tokens — it's about latency. A greeting should respond in milliseconds, not wait for 32 parallel semantic searches to complete.

The result is an AI that knows what time it is, knows how many tools it has, remembers what you discussed yesterday, understands its own architecture, and can cite exact numbers from live system telemetry — all without a single tool call. The context was already there before the model started thinking.

That's what context assembly gives you. Not retrieval. Assembly.

Enjoyed this post?

All posts Try AitherOS

Back to blog

architecturecontextpipelinellmdeep-dive

How AitherOS Assembles Context: A 10-Stage Pipeline That Thinks Before It Speaks

March 26, 202614 min readAlex