Environment Over Instruction: Why We Stopped Prompt Engineering and Built an Operating System Instead
The Prompt That Should Have Failed
Here is a real prompt from a recent session. We did not clean it up:
ok so i want a blog post about our philosophy re: prompting — basically
the idea that instead of prompt engineering we build a rich environment
(graphs, neurons, context pipeline, memory, tools) so any messy prompt
works. like jennifer aniston neurons from neuroscience, matryoshka
representation learning, turboquant eviction — reference all the real
stuff. make it ~11k words, match existing blog style. oh and cite the
GLAM paper and quiroga 2005 and the kusupati MRL paper. contrarian
angle — everyone else optimizes prompts, we optimize the environment
the prompt lands in. aitheros is the proof of concept
Run-on sentences. No capitalization. Vague references to "the real stuff." A parenthetical brain-dump of subsystem names. The kind of prompt that every prompt engineering guide tells you will produce garbage.
Here is what the agent did: it launched three parallel research threads — one to enumerate every graph class in the codebase, one to count neuron types and firing modes, one to read existing blog posts for tone calibration. It correctly disambiguated "jennifer aniston neurons" as Quiroga et al. (2005), "matryoshka representation learning" as Kusupati et al. (2022, NeurIPS), and "turboquant eviction" as the KVCacheGraph eviction advisor. It cross-referenced real line counts, real class names, real stage definitions. It produced structured research across 22 knowledge graphs, 41 neuron types, and a 22-stage context pipeline — all verified against source files.
This is not a success story about prompt engineering. It is a success story about the environment the prompt landed in.
The agent did not need a carefully structured prompt with XML tags, role definitions, chain-of-thought scaffolding, and output format constraints. It needed a codebase indexed into 22 knowledge graphs. It needed 41 neuron types that fire autonomously to gather context before the LLM sees the request. It needed a 22-stage pipeline that assembles 50,000 tokens of grounded context around whatever the user typed. It needed 856 MCP tools with a sub-millisecond predictor that selects the right 8.
The prompt was the spark. The environment was the fuel.
We have spent the past five months building AitherOS — an AI operating system with 211 microservices — and somewhere along the way, we stopped caring about prompts. Not because prompts do not matter, but because we discovered something that matters more: the operational environment the prompt lands in. Build a rich enough environment, and the prompt becomes a formality. A nudge. A direction. The system figures out the rest.
This post is the argument for why, the evidence for how, and the prediction for where the industry is headed.
The Orthodoxy: Prompt Engineering as Religion
Open any prompt engineering guide from the major labs. The advice is remarkably consistent:
OpenAI: "Write clear, specific instructions." Anthropic: "Give Claude context and be specific." Google: "Structured prompts produce better results."
None of this is wrong. It is incomplete — and the incompleteness hides an assumption that shapes the entire industry's approach to AI agents.
The assumption: the LLM is a fixed function, and your only lever is the input.
If the model is a black box, then the rational strategy is to optimize what goes into the box. Hence the elaborate taxonomies of prompting techniques — few-shot, chain-of-thought, tree-of-thought, self-consistency, ReAct, reflection. Each technique is a way of structuring the input to coax better outputs from a function you cannot modify.
This framing has produced an entire industry. Prompt engineering courses. Prompt marketplaces. Prompt optimization frameworks. A job title — "prompt engineer" — that did not exist three years ago and now commands six-figure salaries. The implicit promise: master the input, master the output.
The "Plan then Execute" paradigm extends this logic to agents. AutoGPT, BabyAGI, CrewAI, LangGraph — elaborate planning scaffolding that decomposes vague requests into detailed sub-tasks before execution begins. The theory: if you can turn a messy intent into a precise plan, then each step executes cleanly.
The failure mode is instructive. When the plan is wrong — and plans generated from a single prompt often are — execution faithfully implements the wrong plan. The agent charges through seven carefully sequenced steps, each one correct in isolation, arriving at a destination nobody wanted. Detailed prompts become detailed mistakes. The precision of the instruction amplifies the error, because the system has no independent way to realize the plan is wrong.
Consider a concrete example. You ask an agent: "refactor the payment processing module." A plan-first system generates: Step 1: Read payment_processor.py. Step 2: Identify code smells. Step 3: Extract helper functions. Step 4: Add type hints. Step 5: Write tests. Step 6: Update documentation. Step 7: Create pull request.
The plan looks reasonable. But it does not know that payment_processor.py was rewritten last week and the actual problem is in the billing_gateway.py that calls it. It does not know that the team convention is to put helpers in a shared utils module, not inline. It does not know that the test suite uses a specific factory pattern for payment fixtures. It does not know that "documentation" means updating the API spec in the OpenAPI file, not writing a README.
A senior engineer on the team would know all of this. Not because they received better instructions, but because they operate in a richer environment.
The plan-first paradigm has a deeper problem: it front-loads decisions to the point of lowest information. The moment you have the least understanding of the task — before any code has been read, any context gathered, any history recalled — is the moment you are asked to produce the most detailed specification of how to solve it. This is architecturally backwards. It is like writing a detailed travel itinerary before looking at a map.
Here is a side-by-side comparison. First, the heavily engineered prompt:
<role>You are a senior software architect analyzing the AitherOS
codebase. You have deep expertise in distributed systems, knowledge
graphs, and neuroscience-inspired computing.</role>
<context>The user wants a technical blog post about environment-first
AI agent design. The post should reference the following systems:
- 22 knowledge graphs (BaseFacultyGraph subclasses)
- 41 neuron types (NeuronType enum in AitherNeurons.py)
- 22-stage context pipeline (ContextPipeline.py)
- 856 MCP tools with 3-tier selection (ToolGraph.py)
- 4-tier memory architecture</context>
<instructions>
1. Research each system by reading the source files
2. Verify all numbers against actual code
3. Cross-reference existing blog posts for style
4. Structure the output as a long-form technical essay
5. Cite: Quiroga 2005 (Nature), Kusupati 2022 (NeurIPS), Carta 2023 (ICML)
</instructions>
<output_format>Markdown with YAML frontmatter, ~11000 words,
first-person plural voice, em dashes for parentheticals</output_format>
<chain_of_thought>Think step by step about the best structure for
this argument before writing.</chain_of_thought>
Now the messy version from the top of this post. In AitherOS, both produce equivalent output — because the 300 tokens of engineered prompt add almost nothing that the environment does not already provide. The system already knows the codebase structure (CodeGraph), already knows the blog style (DocGraph + prior post corpus), already knows the neuroscience references (WikipediaGraph + web search neurons), already knows to verify numbers against source files (that is what neurons do).
The engineered prompt is a user doing the system's job. Poorly.
The XML tags tell the system things it already knows — if it has an environment. The chain-of-thought instruction tells the model to reason step-by-step about things the environment could have pre-computed. The output format specification constrains a format the environment could have inferred from past blog posts. Every line of the engineered prompt is a manual substitution for environmental knowledge.
This is not a criticism of the people who write these prompts. It is a criticism of the systems that require them. When your only interface to a model is a text box, of course you pack that text box with as much structured guidance as you can. The problem is not the prompting — it is the poverty of the environment.
The Senior Engineer Analogy
You do not write a 500-word specification to a senior engineer who has worked on the codebase for three years. You say "fix the auth bug" and they figure it out.
They know which files to check. They know the test suite. They know the deployment pipeline. They know that the last three auth bugs were all in the middleware layer. They know the team's conventions, the recent refactors, the technical debt. All of this context exists in their environment — their accumulated knowledge of the codebase, the team, the infrastructure.
Now imagine you took that same senior engineer, wiped their memory, and handed them a detailed specification: "The authentication middleware at line 347 of auth_middleware.py has a race condition in the session validation path. The fix requires acquiring a lock before reading the session store, verifying the token expiry against the current timestamp, and releasing the lock after write-back. Use the existing LockManager from lib/concurrency/locks.py. Write tests using the AuthTestHarness from tests/fixtures/auth.py."
That specification might produce a correct fix. It might also be wrong — maybe the race condition is actually in the session store, not the middleware. The specification embeds assumptions that the senior engineer would have caught and corrected, because they have context the specification author lacks.
This is the inversion we are proposing: instead of optimizing the 200 tokens the user types, optimize the 50,000 tokens of context the system assembles automatically.
The prompt is an address on an envelope. The postal system does the delivery.
Think about what happens when a new engineer joins a team. On day one, they need detailed instructions for everything — where the repo is, how to run tests, what the architecture looks like, who owns which components. By month six, they need almost none of this. Their environment has become rich enough to support autonomous action. They have built mental models of the codebase, the team's conventions, the deployment pipeline, the history of past decisions. The instructions did not get better. The environment got richer.
This is not a metaphor. It is the literal mechanism we implemented. As AitherOS accumulates knowledge graphs, training data, memory, and tool usage patterns, the system requires progressively less prompting precision. A new AitherOS installation with empty graphs and untrained neurons does benefit from more detailed prompts. A mature installation with months of accumulated context does not. The environment matures, and the prompting bar drops.
Anthropic's own research on agent autonomy versus agency supports this framing. Autonomy — the ability to operate independently in novel situations — requires environmental awareness, not just better instructions. An agent that can only follow detailed instructions has agency (it can act) but not autonomy (it cannot adapt). Autonomy emerges from the relationship between the agent and a rich, responsive environment.
Google Cloud's grounding research reaches a similar conclusion from a different direction: LLMs perform dramatically better when grounded in structured data sources. The quality improvement from grounding dwarfs the quality improvement from prompt optimization. A mediocre prompt with excellent grounding beats an excellent prompt with no grounding — every time.
The industry knows this. RAG exists. Tool use exists. Fine-tuning exists. But the dominant mental model still treats these as accessories to prompt engineering, rather than the other way around. We are arguing for the inversion: prompt engineering is the accessory. The environment is the main event.
Jennifer Aniston Neurons and Concept Cells
In 2005, Rodrigo Quian Quiroga and colleagues published a paper in Nature that reshaped our understanding of how biological brains represent concepts. They were recording from individual neurons in the medial temporal lobe of epilepsy patients (who had electrodes implanted for clinical reasons) when they discovered something remarkable: single neurons that responded selectively to specific concepts.
One neuron fired when the patient saw a photograph of Jennifer Aniston. The same neuron fired when the patient saw a different photograph of Jennifer Aniston. It fired when the patient saw a drawing of Jennifer Aniston. It fired when the patient read the text "Jennifer Aniston." It did not fire for other celebrities, other faces, other names.
These were not grandmother cells — the long-hypothesized (and largely discredited) idea that single neurons encode single concepts in a one-to-one mapping. Quiroga's concept cells are distributed but selective. Multiple neurons participate in representing "Jennifer Aniston," but each individual neuron shows remarkable selectivity. The representation is sparse: out of billions of neurons, only a small population activates for any given concept.
What makes concept cells remarkable is their invariance. The Jennifer Aniston neuron does not fire for "a woman with blonde hair" or "a celebrity" or "an actor from a TV show." It fires specifically for Jennifer Aniston — regardless of viewing angle, lighting, artistic style, or modality (visual vs. textual). The neuron has learned the abstract concept, not the surface features. And it learned this without ever being told what "Jennifer Aniston" is.
The critical insight for our purposes: concept cells are not programmed. They emerge from rich perceptual environments.
Nobody told that neuron to learn Jennifer Aniston. No supervisor labeled the training data with "this is Jennifer Aniston, fire for this stimulus." The neuron developed its selectivity through exposure to a rich, multimodal environment — faces, names, voices, contexts — where Jennifer Aniston appeared frequently enough to warrant a dedicated representation. The environment shaped the neuron, not an instruction.
This has a profound implication for AI systems. The prompt engineering approach is analogous to trying to directly program concept cells — specifying exactly what each neuron should respond to, with explicit rules and conditions. The environmental approach is analogous to how concept cells actually develop — building a rich perceptual environment and letting selective representations emerge from exposure.
The neuroscience community has debated whether concept cells are learned or innate. The consensus, supported by developmental studies, is that they are learned — formed through experience with the environment, shaped by the statistics of what the organism encounters. Patients who have never encountered a particular celebrity do not have concept cells for them. Patients who encounter a new person develop new concept cells within sessions. The environment is the teacher.
This maps directly to AitherOS. The system has 41 neuron types that fire autonomously in response to stimuli. When a user message arrives, the IntentClassifier analyzes the message and the NeuronScaler determines which neurons fire and at what intensity. The user never says "activate the CodeGraph neuron" — the environment decides.
User: "fix the auth bug"
IntentClassifier → intent: CODE_TASK, effort: 4
NeuronScaler → fire: [CODE, CODEGRAPH, TESTGRAPH, MEMORY, CONVERSATION]
skip: [WEB, CANVAS, DEPLOY, WIKIPEDIA, MEDIAGRAPH]
→ CodeGraph neuron indexes relevant auth files
→ TestGraph neuron finds related test fixtures
→ Memory neuron recalls past auth conversations
→ Conversation neuron loads recent context
→ All results assembled into context BEFORE the LLM sees the prompt
The neuron firing is sparse, and the sparsity is effort-dependent. At effort level 1–2 (simple greetings, quick lookups), only 3–5 neurons fire — enough to check health and load basic context. At effort level 5–6 (moderate tasks), 8–12 neurons fire — adding graph context, memory recall, and analytical processing. At effort level 7–10 (complex research, multi-step reasoning), 15–20+ neurons fire — the full perceptual apparatus engaged, including recursive refinement and cross-domain bridging.
This is not a design choice inspired by neuroscience — it is the same principle discovered by Quiroga. Sparse, selective activation in response to environmental stimuli produces better representations than dense, uniform activation. The system does not need every neuron for every query, just as the brain does not activate every concept cell for every perception.
The effort-based scaling maps directly to the EffortScaler's pillar activation table:
Effort 1-2: Intent + Context + Creation → 3-5 neurons
Effort 3-4: + Reasoning (gated) + Learning → 6-8 neurons
Effort 5-6: + Orchestration + Quality gates → 8-12 neurons
Effort 7-8: + SASE reasoning + Graph context → 12-16 neurons
Effort 9-10: + Verification + All layers + Calibrate → 16-20+ neurons
The budget increases are not linear. They follow a principle that mirrors biological neural scaling: each additional effort level adds specialized processing, not just more of the same. Effort 7 does not fire 2.3x more neurons than effort 3 — it fires qualitatively different neurons. SASE reasoning activates at effort 7. Multi-agent orchestration activates at effort 8. Full verification with calibration activates at effort 9. The system recruits new cognitive capabilities, not just more compute.
The four firing modes reinforce this parallel:
| Mode | Biological Analogy | Example |
|---|---|---|
| TIMER | Autonomic rhythms (heartbeat, breathing) | Health neuron: fires every 30 seconds |
| EVENT | Reflexive response (touching something hot) | Conversation neuron: fires on new message |
| PREEMPTIVE | Anticipatory activation (flinching before impact) | Canvas neuron: fires when image intent detected |
| ON_DEMAND | Voluntary action (deciding to reach for a glass) | Deploy neuron: fires only on explicit request |
Most neurons fire before the user finishes typing. The PREEMPTIVE mode is particularly interesting — the system begins warming image generation caches when it detects image-related intent in the conversation, even if the user has not explicitly asked for an image yet. The Canvas neuron does not wait for "generate an image of..." — it monitors conversation flow for visual intent signals (discussions about design, mentions of "show me," references to colors or layouts) and pre-warms the ComfyUI pipeline. By the time the user explicitly asks for an image, the generation pipeline is already warm. The latency between request and result drops from cold-start (30+ seconds) to warm (5–8 seconds).
The Codegen neuron behaves similarly. When the conversation shifts toward code discussion — imports mentioned, function names referenced, architecture debated — the Codegen neuron begins pre-loading relevant code context and warming the code generation model. If the user then asks "write a function that..." the system is already primed with the relevant codebase context.
ON_DEMAND neurons (Deploy, Execute, FileWrite) are deliberately exempt from preemptive firing. These are effector neurons — they produce side effects in the real world. The system should never deploy a service or write a file because it predicted the user might want to. This is safety by architecture, not safety by instruction. A prompt-based safety system says "do not deploy unless the user confirms." An environmental safety system makes preemptive deployment structurally impossible — the firing mode does not permit it.
The environment anticipates. It does not wait for instructions. But it anticipates selectively — perception and preparation, never action.
Matryoshka Representation Learning
In 2022, Aditya Kusupati and colleagues introduced Matryoshka Representation Learning (MRL) at NeurIPS. The core idea is elegant: train embeddings so that the first k dimensions form a meaningful sub-representation for any k.
A traditional 1536-dimensional embedding is all-or-nothing — you use all 1536 dimensions or you lose coherence. An MRL embedding is nested like a matryoshka doll: the first 64 dimensions capture coarse meaning, the first 256 capture moderate detail, the full 1536 capture fine-grained semantics. You can truncate at any point and still have a useful representation.
OpenAI's text-embedding-3 uses MRL. It is becoming the industry standard for embedding models. The reason is practical: you can trade precision for speed and storage by using shorter embeddings for initial retrieval, then full-length embeddings for re-ranking.
The relevance to our argument is not about embeddings per se. It is about hierarchical multi-granularity representations that emerge naturally when the training environment provides multi-scale signal.
MRL works because the training objective forces every prefix of the embedding to be independently useful. The first 64 dimensions must capture enough semantic content to retrieve relevant documents from a corpus of millions. The first 256 must capture enough to distinguish between semantically similar documents. The full 1536 must capture enough for fine-grained similarity comparisons. The environment — specifically, the loss function applied at multiple truncation points during training — shapes the representation into a nested hierarchy. The embedding is not hierarchical because someone designed a hierarchical architecture. It is hierarchical because the environment demanded it.
This has a practical consequence that most engineers miss: you do not need to design the hierarchy. You need to design the training environment — the multi-scale loss function — and the hierarchy emerges. The representation self-organizes to place the most important features first, the increasingly detailed features later. No explicit ordering is specified; it falls out of the training dynamics.
The connection to prompt engineering is direct. The prompt engineering approach says: carefully design a hierarchical prompt with system instructions first, then role definitions, then context, then constraints, then output format. This is manual MRL — the engineer hand-designing the feature hierarchy. The environmental approach says: let the system assemble context hierarchically based on learned importance, placing the most critical information first and progressively adding detail based on the query's requirements. This is automatic MRL — the environment designs the hierarchy.
AitherOS implements the same principle for context, not embeddings. The context pipeline assembles information at multiple granularities:
[AXIOMS] → Immutable system principles (never change)
[IDENTITY] → Agent persona and capabilities
[SPIRIT] → Episodic personality overlays
[RULES] → Behavioral constraints
[CONTEXT] → Session-specific knowledge
[MEMORY] → Recalled past interactions
[NEURONS] → Real-time environmental data
[CONVERSATION] → Current exchange
Each layer is independently meaningful. An agent with only axioms and identity can still function — it just lacks specificity. Add spirit and rules, and it has personality and constraints. Add context and memory, and it has grounding. Add neurons and conversation, and it has full situational awareness.
This is MRL for context. The "dimensions" are layers, and the system can truncate at any point based on the effort level. Low-effort queries (effort 1–2) use the first few layers — axioms, identity, basic context. High-effort queries (effort 7–10) use the full representation — all layers, all memory tiers, graph context, recursive refinement.
The EffortScaler is the truncation function:
| Effort | Context Layers | Token Budget | Reasoning |
|---|---|---|---|
| 1–2 | Axioms + Identity + Basic Context | 1,024–1,536 | Skip |
| 3–4 | + Spirit + Rules + Memory | 2,048 | Gate |
| 5–6 | + Neurons + Full Context | 3,072–4,096 | Light |
| 7–8 | + Graph Context + All Layers | 5,120–6,144 | SASE |
| 9–10 | + Recursive Refinement + Cross-Domain | 7,168–8,192 | SASE + Verify |
The prompt does not specify which layers to include. The environment decides, based on the classified intent and effort level. A simple greeting gets 1,024 tokens of context. A complex architectural question gets 8,192 tokens with graph traversal, memory recall, and recursive refinement. Same user, same interface, same messy typing — radically different context assembly.
This is the MRL insight applied at the systems level: every prefix of the context stack is independently useful. An agent with only the first two layers (Axioms + Identity) can still hold a coherent conversation — it just lacks specificity. An agent with four layers (+ Spirit + Rules) has personality and behavioral constraints. An agent with all eight layers has full situational awareness. The truncation point is not arbitrary — it is learned from the effort classification, which is itself learned from the training signal of past interactions.
The parallel to MRL's truncation is not just structural — it is functional. MRL embeddings trade precision for speed by using fewer dimensions. AitherOS context trades depth for latency by using fewer layers. And just as MRL preserves the most important features in the first few dimensions, AitherOS preserves the most important context in the first few layers. Axioms (immutable principles) always load. Identity (who the agent is) always loads. Everything else is progressive enhancement.
In traditional prompt engineering, this truncation decision falls to the user. "Do I include the full system prompt? Do I add few-shot examples? Do I include the tool schemas? How much context should I add?" In an environment-first system, the truncation is automatic, learned, and query-dependent. The user types whatever they want. The environment decides how deep to go.
GLAM and Grounded Language Models
In 2023, Thomas Carta and colleagues presented GLAM at ICML — a study on grounding large language models through interactive environment training. The findings were striking: LLMs fine-tuned with environment interaction dramatically outperformed instruction-only fine-tuning on downstream tasks.
The experimental setup was straightforward. Take a base LLM. Fine-tune one copy with high-quality instructions (the standard approach). Fine-tune another copy by letting it interact with an environment — exploring, making mistakes, observing consequences, adapting. Test both on novel tasks.
The environment-trained model won. Not by a small margin — by a large one. And not just on environment-similar tasks — on general reasoning tasks that had nothing to do with the training environment. Grounding in any sufficiently rich environment improved the model's general capabilities.
Why? The result surprised even the authors. The environment-trained model did not just learn environment-specific skills — it developed general cognitive improvements. Models trained by interacting with text-based environments showed better reading comprehension, better logical reasoning, better common-sense inference. The environment was a catalyst for broad capability development, not just domain-specific training.
The authors argue that environment interaction teaches three capabilities that instruction-following alone does not:
-
Ambiguity resolution. In an environment, ambiguous instructions have observable consequences. The model learns to disambiguate by predicting outcomes, not by parsing syntax.
-
Error recovery. Environments provide feedback. A wrong action produces an observable failure, and the model learns to recognize and recover from mistakes. Instruction-only training does not provide this signal.
-
Partial information handling. Real environments are partially observable. The model learns to act under uncertainty, gathering information incrementally rather than requiring complete specifications upfront.
These are exactly the capabilities that prompt engineering tries to compensate for. "Be specific" compensates for poor ambiguity resolution. "Include examples" compensates for lack of grounding. "Break complex tasks into steps" compensates for poor partial-information handling. All three compensations are workarounds for the absence of an environment.
The GLAM results suggest a provocation: prompt engineering is a symptom of environmental poverty. When the environment is rich enough — when the agent can observe, act, and learn from consequences — the need for precise instructions diminishes. The specificity that prompt engineering demands is the specificity that a rich environment provides automatically.
Consider ambiguity resolution. When a user says "fix the auth bug," a model in a bare environment must guess: which auth system? Which bug? What file? A model in a rich environment does not guess — it checks the bug tracker (EventGraph), reads recent commits (CodeGraph), queries the test suite (TestGraph), and resolves the ambiguity from evidence. The "prompt engineering" solution to this ambiguity is to write: "Fix the JWT token validation race condition in auth_middleware.py by adding a lock around the session store read." The "environmental" solution is to let the agent figure it out the same way a senior engineer would — by looking.
AitherOS is a lived GLAM experiment. The training pipeline — DaydreamCorpus for synthetic experience, SessionHarvester for real interaction data, NeuronTrainer for perception fine-tuning — continuously fine-tunes on environmental interaction data. The system's models are not just trained on instructions; they are trained on the consequences of acting within an environment that includes 22 knowledge graphs, 41 neuron types, and 856 tools.
Every interaction generates training signal:
- When the ToolGraph selects 8 tools and the agent uses 3 of them successfully, that is a training example for ToolNanoGPT. The confirmed tools strengthen their categories; the unused tools get neutral signal (not negative — the agent may not have needed them).
- When a neuron fires and its output is included in the context that produces a good response, that is a positive training signal for the NeuronScaler. The neuron's relevance weight increases for similar future queries.
- When a memory recall turns out to be irrelevant and gets pruned by the WEED stage, that is a training signal for the memory retrieval ranker. Future recalls of that memory type get lower priority for similar queries.
- When the EffortScaler classifies a query as effort 3 but the agent needs 5 tool calls and 8 turns to resolve it, that is a calibration signal. The effort model adjusts upward for similar query patterns.
The environment is both the runtime and the curriculum. It does not stop learning. It does not plateau. Every conversation makes the next conversation slightly better — not because the prompts improve, but because the environment accumulates knowledge about how to support any prompt.
AitherOS as Proof of Concept
Theory is cheap. Papers get published, conference talks get applause, blog posts get upvotes — and then everyone goes back to writing system prompts with XML tags.
The question is whether environment-first design actually works at scale. Not in a controlled experiment with curated benchmarks, but in production, with real users typing whatever they want, at whatever level of specificity they feel like providing on a Tuesday afternoon.
AitherOS is 211 microservices across 12 architectural layers, running in over 150 Docker containers. It is not a research prototype. It processes real conversations daily. The messy prompt at the top of this post is not a hypothetical — it is a real prompt from a real session that produced real output. Here is what the environment looks like from the inside.
22 Knowledge Graphs
Every domain the system operates in has a dedicated graph — a structured, queryable representation of that domain's entities, relationships, and semantics. All graphs inherit from BaseFacultyGraph, which provides singleton lifecycle, pickle persistence with HMAC-SHA256 integrity verification, and async sync to the central KnowledgeGraph service via GraphSyncBus.
| Graph | Domain | What It Knows |
|---|---|---|
| CodeGraph | Python source | AST nodes, call graphs, import chains, complexity metrics |
| TypeGraph | TypeScript/React | Components, hooks, types, route definitions |
| ServiceGraph | Microservices | HTTP call chains, dependencies, critical paths, health |
| ConfigGraph | Configuration | YAML/JSON/TOML structure, cross-references, drift |
| APIGraph | Endpoints | FastAPI routes across all 211 services, parameters, auth |
| DocGraph | Documentation | Markdown structure, cross-references, staleness |
| TestGraph | Test coverage | pytest/Pester relationships, fixture dependencies |
| ScriptGraph | PowerShell | 170+ automation scripts, parameters, pipeline stages |
| InfraGraph | Infrastructure | Docker topology, compose dependencies, port mappings |
| EventGraph | Causal events | Temporal DAG with root cause analysis, critical paths |
| FluxGraph | Event bus | Emit/subscribe patterns across all services |
| MemoryGraph | Agent memory | Tags, types, agents — hybrid keyword + semantic query |
| StrataGraph | Storage | Files, lockboxes, artifacts, ownership chains |
| MediaGraph | Media files | Images, PDFs, audio, video with cross-modal search |
| DirectoryGraph | Registry | Agents, users, tenants, services, lockboxes |
| KVCacheGraph | KV cache | Block modeling, prefetch predictions, eviction links |
| RAGAnythingGraph | Documents | Multi-format RAG: PDF, DOCX, PPTX, XLSX, images |
| WikipediaGraph | Knowledge | Articles, entities, categories — external grounding |
| ToolGraph | Tool selection | 3-tier prediction with NanoGPT, hybrid search, MCTS |
| TopicTransitionGraph | Prediction | Markov chains for speculative context pre-computation |
| LogGraph | System logs | Log analysis and pattern detection |
| KnowledgeGraph (service) | Unified | Central aggregation of all faculty graphs |
The agent never asks "search the codebase." The CodeGraph neuron fires automatically when the intent classifier detects a code-related task. The graph is already warm — it re-indexes on file change events, not on query. By the time the LLM needs codebase context, the context is already assembled.
This is the environment-over-instruction principle in concrete form. No prompt needs to specify "use CodeGraph for code tasks." The environment makes that decision.
What makes this powerful is the cross-domain bridging. A question about "why is the auth service slow?" activates CodeGraph (for the auth service code), ServiceGraph (for the dependency chain), EventGraph (for recent incidents), InfraGraph (for the Docker topology), and potentially FluxGraph (for event flow patterns). The answer emerges from the intersection of multiple knowledge domains — an intersection that no prompt could reasonably specify, because the user does not know which domains are relevant until the answer is found.
All 22 graphs share the same infrastructure: singleton lifecycle management, pickle persistence with HMAC-SHA256 integrity verification, and async fire-and-forget sync to the central KnowledgeGraph service via GraphSyncBus. The sync is bounded — max 1,000 entries per queue, overflow dropped — so graph updates never block the critical path. A circuit breaker protects against KnowledgeGraph connectivity failures. The infrastructure is invisible to the user. It is visible to the environment.
41 Neuron Types
The neurons are the system's autonomous perception layer. Each neuron type is specialized for a domain, fires under specific conditions, and produces structured output that feeds into the context pipeline.
Perception neurons (input/read): GREP, FILE, DOC, CODE, WEB, MEMORY, TOOL, CLUSTER, MESH, BROWSER_CONTEXT — these observe the environment and gather information.
Graph neurons (structured knowledge): CODEGRAPH, DOCGRAPH, CONFIGGRAPH, SERVICEGRAPH, SCRIPTGRAPH, FLUXGRAPH, APIGRAPH, TESTGRAPH, TYPEGRAPH, INFRAGRAPH — each one queries its corresponding faculty graph.
System neurons (infrastructure awareness): SECRETS, NEXUS, CONVERSATION, SEMANTIC, INDEX, HEALTH, SYSTEM, ARCHITECTURE, GPU, AXIOM — these provide runtime state and system health.
Analytical neurons (reasoning support): DIFF, MERGE, AUDIT, OCR, TENANT_DOCS, SUMMARY, JUDGE — these process and evaluate information.
Effector neurons (action/output): CANVAS, CODEGEN, FILEWRITE, FILEEDIT, EXECUTE, DEPLOY — these produce artifacts and side effects. Critically, effector neurons are ON_DEMAND only. The system never writes to disk or deploys a service without explicit user intent. Safety is architectural.
Meta-cognitive: RLM — recursive language model for deep context analysis with external variable management.
The key insight: most neurons fire before the user finishes typing. TIMER neurons (health, memory sync) run on background schedules. EVENT neurons (conversation, graph updates) trigger on Flux events. PREEMPTIVE neurons (canvas, codegen) activate when intent patterns suggest upcoming needs. Only ON_DEMAND neurons (deploy, execute) wait for explicit requests.
This means the environment is continuously self-updating. When a file changes, the relevant graph neurons re-index. When a conversation message arrives, the memory and conversation neurons fire. When system health shifts, the health neuron updates. The LLM receives context from an environment that is perpetually current — not from a snapshot taken at query time.
For a deeper treatment of the neuron system, see our previous post on autonomous background intelligence.
22-Stage Context Pipeline
The context pipeline is where environment-first design pays its biggest dividend. Instead of asking the user to specify what context the LLM needs, the pipeline assembles it automatically through 22 stages:
Stage 0: CONNECT DAEMON Establish service connectivity
Stage 0.5: MANIFEST Bandwidth-aware memory index (always loaded)
Stage 1: CLASSIFY What does this message need?
Stage 1.5: LOAD ADAPTED WEIGHTS Pull learned weights from EvolutionEngine
Stage 2: SCALE Determine neuron count and effort budget
Stage 3: CACHE HIT Check warmth, expire stale entries
Stage 3.5: FLUX Inject real-time system state (no HTTP, instant)
┌─────────── PARALLEL GATHER BLOCK ───────────┐
│ Stage 4: GATHER (Assembler) │
│ Stage 5: ENRICH (ParallelContextOrch) │
│ Stage 5.5: GRAPH (knowledge graph context) │
│ Stage 5.6: CROSS_DOMAIN (lateral context) │
│ Stage 5.7: MEMORY_BUS (AitherMemoryBus) │
│ Stage 5.7.5: CONV_RECALL (past exchanges) │
│ Stage 5.8: TENANT_DOCS (tenant-specific) │
└─────────────────────────────────────────────┘
Stage 5.9: RECURSIVE REFINEMENT RLM-inspired quality loop
Stage 6: INGEST Bridge gathered chunks into cache
Stage 7: WEED Remove sensitive patterns
Stage 7.5: QUERY-CONDITIONED RE-SCORING Re-rank by query relevance
Stage 8: SURGICAL BUDGET Evict lowest-priority chunks (not truncate)
Stage 9: SCORE Quality assessment of final context
Stage 10: TRACK Record usage for outcome correlation
Stage 11: X-RAY Capture pipeline snapshot for introspection
Each stage makes independent decisions. The CLASSIFY stage determines which downstream stages activate. The SCALE stage sets the neuron budget. The PARALLEL GATHER block launches seven context sources simultaneously — will/spirit/affect assembly, neuron enrichment, graph queries, cross-domain lateral context, memory bus queries, conversation recall, and tenant-specific documentation — all via asyncio.gather().
The WEED stage is worth highlighting. It strips sensitive patterns — secrets, API keys, internal URLs — from the assembled context before the LLM sees it. This is a security boundary implemented as a pipeline stage, not a prompt instruction. You cannot prompt-engineer security. You can only build it into the environment.
The SURGICAL BUDGET stage is equally important. When the assembled context exceeds the token budget, this stage does not truncate (which would destroy coherent passages). It evicts the lowest-priority chunks as whole units, preserving the integrity of the remaining context. Priority is determined by source, freshness, query relevance, and historical usefulness — all signals from the environment, not from the prompt.
Two stages deserve special attention for the environment-over-instruction argument.
Stage 7.5 — QUERY-CONDITIONED RE-SCORING — re-ranks all assembled context chunks against the current user query. This is the pipeline's way of asking: "given what the user actually asked, is this context chunk still relevant?" Chunks that scored highly during the GATHER phase (because they matched the topic broadly) may score lower after re-scoring (because they do not match the specific question). This is something prompt engineering cannot do — it is a runtime relevance judgment that requires seeing both the assembled context and the query simultaneously.
Stage 11 — X-RAY — captures a full pipeline snapshot for introspection. This is not just logging. The snapshot records which stages activated, which chunks survived, which were evicted, and why. This data feeds back into the training pipeline, allowing future runs to improve their gathering and weeding strategies. The pipeline is self-improving — each run generates training signal about what context was useful and what was noise. Over thousands of runs, the pipeline learns the statistical structure of what makes context valuable for different query types. No prompt can teach this. Only an environment that observes its own performance can learn it.
4-Tier Memory Architecture
Memory is assembled, not retrieved. The pipeline decides what to recall based on intent, effort, and conversation trajectory — not based on a "search your memory" instruction in the prompt.
Tier 1: WorkingMemory — Session-scoped, high-frequency. Current conversation context, active task state, recent tool outputs. Evicted at session end.
Tier 2: Spirit — Episodic and semantic memory. Past interactions, learned preferences, personality-consistent recall. Persists across sessions. The Spirit system loads personality overlays that sit between IDENTITY and RULES in the context hierarchy — shaping how the agent recalls and presents information.
Tier 3: KnowledgeGraph — Structured facts. Entity-relationship triples extracted from conversations, documents, and code. Queryable by graph traversal, not just keyword search. Fed by all 22 faculty graphs via GraphSyncBus.
Tier 4: Nexus — Cross-service vector store. Embeddings from every service — Strata telemetry, Chronicle logs, Flux events, code artifacts — searchable by semantic similarity. This is the long-term memory that connects domains: a conversation about auth bugs recalls not just the conversation, but the relevant code changes, the test results, the deployment logs, and the monitoring alerts.
The user never says "check my memory" or "recall what we discussed about auth." The MEMORY_BUS stage of the pipeline queries all four tiers automatically, ranks results by relevance to the current query, and injects the top matches into context. The prompt is unaware that memory exists. The environment handles it.
Consider what this means for the prompt engineering paradigm. In a memoryless system, the user must re-establish context every session: "Last time we discussed the auth middleware refactor. We decided to use the LockManager pattern. Here is the code we wrote..." In a memory-rich environment, the user says "continue with the auth work" and the system recalls the full context automatically — the decision, the code, the rationale, the test results from the last run.
The prompt engineering solution to context re-establishment is to maintain a "system prompt" or "conversation template" that captures ongoing context. This is manual memory — the user doing the system's job. And it is fragile: the template goes stale, the context drifts, the user forgets to update it. Environmental memory is automatic, always current (because it is written as events happen, not retroactively), and queryable (because it is structured, not free-text).
The deepest implication of environmental memory is that the system develops a longitudinal understanding of the user's work. It does not just recall individual conversations — it recalls patterns. A user who asks about auth bugs three times in a month has a pattern. A user who consistently works on the frontend in the morning and the backend in the afternoon has a pattern. These patterns inform context assembly: the system anticipates what context the user will need based on historical patterns, not just the current query.
856 MCP Tools with 3-Tier Selection
Every AI platform eventually hits the tools wall. With 856 registered MCP tools, the naive approach — include all tool schemas in the LLM context — would consume 342,400 tokens (856 tools x ~400 tokens per schema). That is the entire context window, consumed by tool definitions alone.
ToolGraph solves this with a 3-tier selection cascade:
Tier 1: NanoGPT Predictor (<1ms). A tiny character-level model (1 layer, 32-dimensional embeddings, 4 attention heads) trained on past tool selection patterns. For each candidate tool, it evaluates loss = model.evaluate(f"{task}\n{tool_name}") and returns tools sorted by ascending loss. This model retrains automatically every 30 minutes on accumulated session data. After 50 sessions, it reliably predicts tool categories before keyword search even runs.
Tier 2: Hybrid Search (~10–160ms). TF-IDF keyword search over tool names and descriptions (10ms) combined with semantic similarity via embedding cosine distance (150ms when available). Keyword and semantic scores merge at 40/60 weighting. Results are boosted by NanoGPT category predictions (3x) and IntentToolMap routing (1.5x).
Tier 3: Full Schema Fallback. Only when Tiers 1 and 2 produce empty or low-confidence results — return all tools. In practice, this almost never fires after the system has been running for a few days.
The result: 8 tools selected from 856, in under 200 milliseconds, with no user instruction required. Over a 20-turn conversation, this saves approximately 176,000 tokens compared to the naive approach — a 73% reduction in context overhead.
The training loop deserves emphasis because it illustrates the environmental learning principle. ToolNanoGPT records outcomes after every interaction: record_outcome(query, predicted_categories, actual_tools_used, success). It analyzes which category predictions were CONFIRMED (predicted and used), REJECTED (predicted but not used), or DISCOVERED (not predicted but used). These signals accumulate in Library/Training/tool_selection/sessions.jsonl. After 50 positive examples, the base model trains. After 20 examples per category, LoRA adapters specialize for that category. Every 30 minutes, the system checks whether enough new data has accumulated to justify retraining.
This is a closed-loop learning system. The environment observes tool selection outcomes, adjusts the selection model, and produces better selections next time. No prompt engineering can replicate this feedback loop — because prompts do not learn from outcomes. Each prompt starts from scratch. The environment compounds.
The numbers tell the story: before ToolGraph, every LLM call included 30+ tool schemas consuming 12,000+ tokens. After ToolGraph, each call includes 8 schemas consuming ~3,200 tokens. But the interesting number is the accuracy trajectory. ToolNanoGPT starts cold — its initial selections are essentially random among plausible candidates. As it accumulates session data (minimum 50 positive examples before the base model trains, 20 per category for LoRA adapters), its predictions sharpen. The selection quality improves monotonically with usage because the training signal is clean: was the predicted tool actually used? The environment gets better. The prompt does not need to.
For multi-step workflows, MCTSToolRouter extends this with Monte Carlo Tree Search. It explores tool chains — ordered sequences like GATHER (search) → PROCESS (reason) → PRODUCE (generate) → VALIDATE (security) — evaluating them across 8 scoring dimensions including relevance, coverage, complementarity, and chain coherence. Budget: 150 iterations in 500 milliseconds. The LLM receives a curated, ordered tool chain, not an unordered menu.
For the full technical deep-dive on ToolGraph, see our previous post on how ToolGraph makes massive MCP work.
MCTS Planning
The Monte Carlo Tree Search engine deserves its own mention because it embodies the environment-first principle directly. MCTS does not optimize prompts — it optimizes the action space before the LLM encounters it.
The unified MCTS engine runs domain-specific searches:
Tool selection: 150 iterations, 500ms budget. Explores which tools to offer and in what order. The tree encodes execution order — root is empty, each level adds a tool, leaves are complete tool sets. The evaluator rewards forward-flowing chains (search before reasoning before generation before validation).
Intent routing: 200 iterations, 1000ms budget. Explores agent chains — which agents should handle which parts of a complex task. Effort-based depth: effort 1–2 gets 1 agent, effort 8–10 gets a 4-agent chain.
Planning: Domain-agnostic search with a world model that tracks (state, action) → (next_state, reward) transitions. When the world model has enough data, it enables model-based simulation instead of random rollouts — the system simulates tool combinations before the LLM sees any schemas.
The environment simulates before it acts. The LLM receives pre-evaluated options, not raw possibilities.
This is a direct inversion of the prompt engineering approach to tool use. The standard advice: "list the tools the agent should use" or "provide tool schemas in the prompt." The environmental approach: let the system simulate tool combinations, evaluate their expected outcomes, and present only the best chain. The user's prompt does not mention tools. The environment selects them, orders them, and presents them — already evaluated, already ranked.
The MCTS approach has a subtle advantage over prompt-based tool selection: it considers tool interactions. A prompt can list tools, but it cannot easily express that "if you use web_search, you should also use summarize to process the results" or "codegraph_search and testgraph_search complement each other but codegraph_search and canvas_generate do not." The MCTS evaluator scores complementarity (5%), coherence (10%), and chain order (15%) — three dimensions of tool interaction that prompt-based selection ignores entirely.
STTP Dimensional Memory
The Spatio-Temporal Transfer Protocol adds a dimension most systems lack: the system knows HOW it was thinking, not just WHAT it thought.
Every session is tracked along four AVEC dimensions:
- Autonomy: How self-directed versus guided the agent operates (high = independent, low = user-led)
- Stability: Baseline steadiness versus volatility in decision-making (high = consistent, low = reactive)
- Friction: Resistance versus flow in execution (high = deliberate/cautious, low = rapid/fluid)
- Logic: Structured reasoning versus intuitive association (high = systematic, low = pattern-matching)
These vectors persist across sessions. When a new session begins, STTP calibrates against the last session's AVEC vector, detecting drift. If the agent was highly autonomous and high-friction (deliberate) last session, but the new session requires rapid, low-friction execution, the system adjusts its reasoning posture accordingly.
This is environmental context about the agent itself — not about the user's request, but about how the agent has been operating. No prompt can provide this. It requires an environment that observes and records its own cognitive state.
The AVEC dimensions create a form of meta-cognition that prompt engineering cannot replicate. Consider: a prompt can tell an agent to "think carefully" or "be thorough" — but these are static instructions that apply uniformly. STTP provides dynamic meta-cognition: the system knows that it has been drifting toward high-autonomy, low-friction behavior (perhaps because recent tasks were routine), and it can self-correct by increasing friction (deliberateness) when it encounters a novel task. The adjustment is automatic, based on dimensional drift detection, not on a user noticing that the agent seems less careful lately and adding "please double-check your work" to the prompt.
This is the deepest form of environmental intelligence: the environment is not just aware of the world — it is aware of itself operating within the world. It has proprioception. And proprioception, as any athlete or dancer knows, is the foundation of skilled performance.
The Real Example, Revisited
The preceding sections described components in isolation. But the environment-over-instruction thesis is really about how these components compose — how 22 graphs, 41 neurons, 22 pipeline stages, 856 tools, and 4 memory tiers work together to transform a messy prompt into a grounded result.
Let us return to the messy prompt from the opening and trace exactly what happened, stage by stage.
ok so i want a blog post about our philosophy re: prompting — basically
the idea that instead of prompt engineering we build a rich environment
(graphs, neurons, context pipeline, memory, tools) so any messy prompt
works. like jennifer aniston neurons from neuroscience, matryoshka
representation learning, turboquant eviction — reference all the real
stuff. make it ~11k words, match existing blog style. oh and cite the
GLAM paper and quiroga 2005 and the kusupati MRL paper. contrarian
angle — everyone else optimizes prompts, we optimize the environment
the prompt lands in. aitheros is the proof of concept
Step 1: Intent Classification (Stage 1, <5ms). IntentClassifier identifies: intent=CONTENT_CREATION, sub-intent=BLOG_POST, effort=8 (long-form, research-heavy, multiple sources). No keyword matching — the classifier is a trained model that recognizes creation intent from conversational structure.
Step 2: Neuron Scaling (Stage 2, <1ms). EffortScaler at effort 8: fire 15+ neurons, enable SASE reasoning, activate multi-agent orchestration, set output budget to 6,144 tokens, allow up to 25 turns with minimum 3 tool calls.
Step 3: Parallel Gather (Stages 4–5.8, ~2s total). Seven context sources fire simultaneously:
- DocGraph neuron: retrieves existing blog post corpus, identifies tone and formatting conventions
- CodeGraph neuron: indexes all source files mentioned (AitherNeurons.py, ContextPipeline.py, ToolGraph.py, etc.)
- Memory neuron: recalls past conversations about philosophy, architecture, design decisions
- WikipediaGraph: retrieves Quiroga 2005 context, concept cells, medial temporal lobe research
- Web search neuron: fetches Kusupati 2022 (MRL, NeurIPS), Carta 2023 (GLAM, ICML)
- ConvRecall: loads recent conversation context for continuity
- CROSS_DOMAIN: pulls infrastructure context (service counts, Docker topology) for accurate numbers
Step 4: Disambiguation. "jennifer aniston neurons" → Quiroga et al. (2005), Nature 435 — resolved via WikipediaGraph, not keyword matching. "matryoshka representation learning" → Kusupati et al. (2022), NeurIPS — resolved via web search with academic source preference. "turboquant eviction" → KVCacheGraph's graph_eviction_advisor.py — resolved via CodeGraph index. "the GLAM paper" → Carta et al. (2023), ICML — resolved via web search.
Step 5: Verification (Stages 7–9). Every number in the draft is verified against source code. "22 graphs" → grep for BaseFacultyGraph subclasses, count: 22. "41 neurons" → read NeuronType enum, count: 41. "22-stage pipeline" → read ContextPipeline.py stage definitions, count: 22. "856 tools" → query ToolGraph index, count: 856. Numbers that do not match source are flagged and corrected.
Step 6: SURGICAL BUDGET (Stage 8, ~5ms). The gathered context totals approximately 65,000 tokens from all sources combined. The budget for effort 8 is 6,144 output tokens, but the context budget (input to the LLM) is significantly larger. The SURGICAL BUDGET stage evaluates each context chunk on source priority, query relevance (post-re-scoring), and freshness. Low-scoring chunks are evicted as whole units. The system preserves complete code snippets over partial ones, recent memory over stale memory, high-confidence graph results over speculative ones. The final context package is approximately 48,000 tokens — dense, relevant, and grounded.
Step 7: Generation. The LLM receives: system prompt with axioms and identity (~2,000 tokens), the assembled context (~48,000 tokens), and the user's messy prompt (~100 tokens). Of the roughly 50,000 tokens the model processes, 100 came from the user. The rest came from the environment.
This ratio — 100 user tokens to 50,000 environmental tokens — is the thesis of this entire post expressed as a number. The environment contributes 500x more information than the prompt. Optimizing the prompt is optimizing 0.2% of the input. Optimizing the environment is optimizing the other 99.8%.
The counterfactual. Submit the same messy prompt to a vanilla LLM — no knowledge graphs, no neurons, no context pipeline, no tools. The model would hallucinate plausible but wrong numbers. It might say "43 neuron types" (the number from our project plan, not the actual code where we count 41). It would guess at blog style instead of reading existing posts. It would produce generic prose about "the importance of environment" without grounding in actual system architecture. It would cite papers approximately, getting years or venues wrong — maybe placing Kusupati's MRL paper at ICML instead of NeurIPS. The output would be fluent, confident, and ungrounded — the worst kind of AI writing.
You could fix this with a better prompt. You could specify every number, every citation, every style constraint. But then you are doing the agent's job. You are manually providing the environmental knowledge that the system should have independently. And the next time you ask a similar question, you will have to do it all again — because the vanilla LLM has no memory, no graphs, no accumulated knowledge.
The prompt did not produce a good result. The environment did.
The Tradeoff: The Infrastructure Tax
We should be honest about what this approach costs. Intellectual honesty requires acknowledging that environment-first design is not free, not easy, and not appropriate for every situation.
AitherOS is 211 microservices running in over 150 Docker containers across 12 architectural layers. The context pipeline alone involves 22 sequential-and-parallel stages. The neuron system maintains 41 neuron types with four firing modes. The knowledge graph infrastructure indexes source code, documentation, configuration, tests, scripts, events, media, and external knowledge into 22 specialized graphs. The tool selection system trains and runs a NanoGPT model, an embedding index, a TF-IDF scorer, and an MCTS engine — all to select 8 tools from 856.
This is not a weekend project. It is not an npm package you install. It is months of focused infrastructure investment — and it requires ongoing maintenance, monitoring, and evolution. The graphs need re-indexing. The neurons need training data. The pipeline stages need tuning. The tool selection model needs feedback. The memory tiers need garbage collection. There is operational overhead that a simple chat API does not have.
There is also a complexity tax. When something goes wrong in a 22-stage pipeline, debugging is harder than when something goes wrong in a single prompt. The X-RAY stage (Stage 11) exists specifically to address this — capturing full pipeline snapshots for post-mortem analysis — but it adds to the infrastructure burden. More components means more failure modes. More integration points means more potential for subtle bugs. We have learned this the hard way, repeatedly.
For most people, prompt engineering IS their only lever. If your environment is a blank chat window connected to an API, then "be specific" and "provide context" and "use structured prompts" is genuinely the best advice available. The prompt engineering guides are not wrong for the environments they assume.
But the environment they assume — a stateless, toolless, memoryless LLM behind an API — is not the only environment possible. And it is not the environment the industry is building toward.
Look at the trajectory: MCP gives LLMs access to external tools. RAG gives them access to external knowledge. Fine-tuning gives them domain-specific capabilities. Agent frameworks give them multi-step planning. Memory systems give them persistence. Each of these is a step toward a richer environment — and each one reduces the marginal value of prompt optimization.
Plot these developments on a timeline and the trend is unmistakable:
2022: LLMs as chat interfaces → Only lever: the prompt
2023: + Tool use, function calling → Environment provides action capability
2024: + RAG, vector stores, MCP → Environment provides knowledge
2025: + Memory, fine-tuning, grounding → Environment provides adaptation
2026: + Knowledge graphs, neurons, MCTS → Environment provides cognition
Each year, the environment gets richer and the prompt gets less important. This is not our prediction — it is the observed trajectory of the entire industry.
The prompt engineering era is early computing. It is hand-coding assembly because compilers do not exist yet. When you program in assembly, every instruction matters. You manually manage registers, manually handle memory, manually structure control flow. The code is fragile — a single wrong instruction crashes the program. Sound familiar? "Write clear, specific instructions. Structure your prompt carefully. One ambiguous phrase will derail the output."
Compilers changed everything. They allowed programmers to express intent at a higher level and let the compiler figure out the implementation details. The assembly still exists — the compiler generates it — but the programmer does not write it. The quality of the generated assembly improved as compilers got better, not as programmers got better at writing assembly.
The environment is the compiler for prompts. It takes a messy intent and generates the detailed, structured context that the LLM actually needs. As the environment improves — more graphs, better neurons, smarter tool selection — the quality of the generated context improves. The user's "assembly" (their raw prompt) becomes progressively less important.
Our prediction: within two years, the most productive AI systems will be the ones with the richest environments, not the most carefully crafted prompts. The competitive advantage will shift from "how well can you prompt" to "how rich is your agent's operational environment." The prompt engineering job title will not disappear — but it will be absorbed into a larger discipline that looks more like systems engineering than creative writing.
The organizations that invest in environmental infrastructure now will have a compounding advantage. Knowledge graphs get richer over time. Neurons get better trained. Context pipelines get more accurate. Tool selection gets more precise. Memory gets deeper. Each interaction improves the environment, which improves the next interaction. The flywheel does not spin for prompt engineering — each prompt is independent, starting from scratch. The flywheel spins for environments.
There is a second-order effect that makes the compounding even stronger: environmental improvements compose. A better CodeGraph makes the CodeGraph neuron more useful, which makes the context pipeline produce better context, which makes the LLM generate better code, which generates better training data for the ToolNanoGPT, which makes future tool selection more accurate. Each component improvement cascades through the entire stack. Prompt improvements do not compose — each prompt stands alone, benefiting only its own execution.
We can illustrate this with a concrete example from our deployment. When we added the KVCacheGraph in early 2026, it did not just improve KV cache management. It provided the context pipeline with new signals about memory pressure, which allowed the SURGICAL BUDGET stage to make better eviction decisions, which reduced context overhead per query, which allowed longer conversations before hitting token limits, which generated more training data per session, which improved ToolNanoGPT accuracy. A single graph addition rippled through five other systems, each one improving slightly. Over a month, the cumulative effect was a measurable improvement in response quality — not because of any prompt change, but because the environment became denser.
This is the fundamental asymmetry: prompt optimization is additive (each improvement helps one query), while environmental optimization is multiplicative (each improvement helps all future queries through all downstream systems). Given finite engineering effort, the environmental investment dominates over time.
The Prompt Is Just the Address
We have been arguing a simple thesis: the intelligence of an AI agent system lives in the infrastructure, not the instruction.
The prompt is the address on an envelope. It tells the system where to go. But the postal system — the sorting facilities, the routing algorithms, the delivery trucks, the address databases — does the actual work. You can write the address in perfect copperplate calligraphy or in barely legible scrawl. If the postal system is good enough, the letter arrives. If the postal system is poor, even perfect calligraphy will not help — because the problem was never the handwriting.
Think about what happened to web search. In 1996, you had to craft precise boolean queries to find anything: "Jennifer Aniston" AND "concept cells" AND neuroscience NOT gossip. Today, you type "jennifer aniston neurons brain" and Google figures it out. The queries did not get better. The search environment — PageRank, knowledge graphs, query understanding, semantic search — got richer. Nobody teaches "search engineering" courses. Nobody optimizes their Google queries with XML tags and chain-of-thought scaffolding. The environment absorbed the complexity that users once had to manage manually.
AI is on the same trajectory. The prompt engineering industry is optimizing handwriting. We are building a better postal system.
This is not a novel insight in other fields. Neuroscience has known it for decades — concept cells emerge from rich perceptual environments, not from explicit instructions. Embodied cognition research has demonstrated that intelligent behavior requires a body situated in an environment, not just a brain processing inputs. The extended mind thesis (Clark and Chalmers, 1998) argues that cognition does not stop at the skull — it extends into the environment, into the tools and structures that the mind uses to think. A notebook is part of your cognitive system when you use it to remember. A calculator is part of your cognitive system when you use it to compute. The environment is not just a context for cognition — it is a component of cognition.
Robotics learned this lesson the hard way. The classical AI approach — sense, plan, act — produced robots that worked in simulation and failed in the real world. Rodney Brooks's subsumption architecture (1986) abandoned explicit planning in favor of layered behaviors that interact with the environment directly. The robot did not plan how to navigate a room. It had simple behaviors (avoid obstacles, seek light, follow walls) that composed in the environment to produce navigation. The intelligence was not in any single behavior. It was in the relationship between behaviors and environment.
AI is catching up to what these fields have known. The LLM is not a brain in a vat, waiting for the perfect prompt to produce intelligent behavior. It is a component in a system, and the system's intelligence emerges from the relationship between the component and its environment. Enrich the environment, and the component becomes more capable — regardless of how precisely you address it.
The 22 knowledge graphs are the agent's accumulated understanding of its world. The 41 neuron types are its autonomous perception. The 22-stage context pipeline is its cognitive architecture. The 856 tools with MCTS selection are its action repertoire. The 4-tier memory is its experience. Together, they form an environment rich enough that a messy, one-line, stream-of-consciousness prompt produces expert-level output.
We are not saying prompts do not matter. We are saying they matter less than the environment they land in — and that the gap between "matters" and "matters less" widens every time the environment improves. The trajectory is clear. The infrastructure is being built. The question is not whether the industry will move toward environment-first design, but how fast.
The best prompt is the one you did not have to write.
References:
- Quiroga, R.Q., Reddy, L., Kreiman, G., Koch, C., & Fried, I. (2005). "Invariant visual representation by single neurons in the human brain." Nature, 435, 1102–1107.
- Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2022). "Matryoshka Representation Learning." NeurIPS 2022.
- Carta, T., Romac, C., Wolf, T., Lamprier, S., Sigaud, O., & Oudeyer, P.-Y. (2023). "Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning." ICML 2023.
- Clark, A. & Chalmers, D.J. (1998). "The Extended Mind." Analysis, 58(1), 7–19.
- Anthropic. (2025). "Building effective agents."
- Brooks, R.A. (1986). "A robust layered control system for a mobile robot." IEEE Journal on Robotics and Automation, 2(1), 14–23.
- Brooks, R.A. (1991). "Intelligence without representation." Artificial Intelligence, 47(1-3), 139–159.