We've written before about treating the context window as a scratchpad, not a database — curating the active prompt down to a tight, surgical briefing instead of an encyclopedia dump. That's the right instinct, and it carries you a long way.

But curation has a ceiling, and you hit it the moment the thing you need to reason about is itself enormous.

Picture a two-million-character log file. A question: "How many of these requests returned a 500 after the deploy at 14:02?" You cannot curate your way out of this. To pick the relevant 4,000 tokens, you'd have to already know which lines matter — which is the entire question. Truncate from the top and you might cut the deploy marker. Summarize it first and you've paid to compress information you haven't read, lossily, before you even start. Stuff the whole thing in a frontier model's giant context window and you get a confident answer that's quietly off by 30%, because counting-by-vibes is what language models do when you ask them to count.

The honest move is to admit the model shouldn't read the file at all. It should query it.

That's the idea behind Recursive Language Models, and it's now a first-class cognitive primitive in AitherOS.

The flip

A normal LLM call looks like this:

answer = llm(context + query)

You concatenate the context into the prompt and hope. A Recursive Language Model call looks like this:

answer = rlm(context, query)

The difference is everything: the context is never in the prompt. It lives outside, as a variable. The model is given a REPL — a small command-and-code environment — and told, in effect: "There's a large body of context you cannot see. Here are tools to examine it. Use them to answer the question."

The model never reads the haystack. It navigates it — the way an engineer navigates a codebase too big to hold in their head: grep for the symbol, open the three files that matched, read the function, move on. The paradigm comes from recent research by Zhang, Kraska, and Khattab; we implemented it as a runtime our agents can call.

The loop

Under the hood it's a turn-by-turn REPL loop. Each turn, the model emits exactly one action. We execute it, hand back the output, and let the model decide its next move. It repeats until the model declares a final answer or hits an iteration cap.

The action vocabulary is small and deliberately filesystem-flavored:

peek(start, length) — view a slice of the context by character or line offset. The model uses this first, to learn the shape of the data without ingesting it.
grep(pattern) — regex search across the context; returns matching lines with line numbers. This is the workhorse.
execute(code) — run Python in a sandbox with the context pre-loaded as context (a string) and context_lines (a list). This is the part that matters most.
llm_query(sub_query) — spin up a recursive sub-model on a slice of the context. More on this below.
partition_map(chunk_size, sub_query) — split the context into chunks, fan a sub-model across all of them in parallel, and merge the results. Map-reduce over text.

Here's a real trace, lightly cleaned up, for the 500-counting question:

> peek(0, 2000)
[shows the first 2000 chars — it's JSON-lines, one request per line]

> grep("\"status\": 500")
[L40213: {... "ts":"14:02:11", "status":500 ...}]
[L40288: {...} ]
... [50 matches shown, more exist]

> execute:
    after = [l for l in context_lines
             if '"status": 500' in l and l_ts(l) >= "14:02:00"]
    print(len(after))
1,873

> FINAL: 1,873 requests returned a 500 after the 14:02 deploy.

That execute step is the whole game. The model didn't estimate 1,873 — it computed it, over data it never had to fit in its prompt. Counting, filtering, aggregating, joining, date math: all the things language models are famously bad at become a few lines of Python that run against the real bytes. (When the context is genuinely huge — past half a megabyte — the runtime writes it to a temp file inside the sandbox instead of inlining it, so even the sandbox's own memory stays sane.)

This is the difference between an analyst who eyeballs a spreadsheet and one who writes a formula. One of them is reliably correct.

The "recursive" part

llm_query is where the R in RLM earns its name. When the root model says "go classify the labels in lines 0–5,000," it doesn't get a flat sub-call — it spawns an entire new Recursive Language Model one level deeper, with its own REPL over its own slice. That sub-model peeks, greps, and computes independently, then returns a single distilled answer. The root model never sees the 5,000 lines. It sees one clean result.

This recurses, with a depth limit (we cap it at three), bottoming out in a direct answer at the leaves. It's divide-and-conquer over context: a tree of small, focused reasoning jobs, each one looking at a piece small enough to actually understand, the results bubbling up to a root that only ever holds summaries. A frontier model with a million-token window holds the whole haystack in attention and pays for it in both dollars and accuracy. A recursive model holds almost nothing at any given node — and stays sharp because of it.

Two details we learned the hard way:

Every REPL turn is a lean call. Early on, each turn re-ran our full context-assembly pipeline — graph queries, memory recall, the works — which meant every peek cost 10–18 seconds of unrelated retrieval. That's absurd: the model is already holding its context; it doesn't need us to assemble more. We strip the heavy pipeline off the inner loop. Each turn is now just the REPL state and the model.

Everything is traced. Each step — the action, its arguments, the output preview, timing, token cost — is recorded as a structured, fully interpretable trace. You can replay exactly what the model peeked at, what it grepped, what code it ran, and what came back. The good discoveries get persisted into our knowledge graph so the next query can stand on them. An RLM run isn't a black box that emits an answer; it's an auditable transcript of an investigation.

The part where we broke it in production

Here's the chapter most write-ups skip.

The first time we wired this into the live chat path, we did the obvious thing: we made the Recursive Language Model a terminal answer-producer. If a request was heavy enough, the REPL ran, and whatever it returned became the reply. Clean. Direct. Done.

It was a mistake, and it failed in the nastiest possible way.

A REPL loop doesn't always converge. Sometimes it runs out of iterations mid-investigation. Sometimes a sub-call errors. Sometimes the model burns its whole budget thinking and emits nothing. And because we'd wired the REPL's output directly to the answer, those non-answers — an empty string, an [ERROR] marker, a half-finished thought — got surfaced to the user as the response. The single worst outcome for a system whose entire pitch is "surgical, reliable context": confidently shipping a blank.

So we pulled it. The terminal path is disabled to this day, on purpose, and the code carries a comment explaining why so nobody re-enables it without reading the story first.

Then we rebuilt it the right way, around a principle we should have started with: a tool may produce material, but a tool must never own the final word.

In the rebuilt design, the Recursive Language Model is not an answer-producer. It's a distillation tool the orchestrator calls on demand. When the agent loop runs into a large blob of context, it invokes the RLM to compress that blob down to the relevant slice — and then feeds that slice into the reasoning model, which decides what to actually say. The REPL informs the answer; it never is the answer.

And we put a structural guarantee underneath it, which we call crystallization:

As the agent works, every durable conclusion it reaches is written to a findings ledger — a small, deduplicated record of what's been established, separate from the raw, messy turn-by-turn history.
When context gets long and we compact it, the ledger is re-injected as an [ESTABLISHED …] digest, so verified conclusions survive the compaction that sheds the noise.
If the loop is ever about to return empty, a forced synthesis kicks in — thinking turned off, tools stripped away — that reads the ledger and writes a real, grounded answer from what's already known.

Stack those together and the old failure becomes impossible by construction. The orchestrator's synthesis always owns the final reply, and it always has the crystallized ledger to draw on. An empty or errored REPL result has nowhere to go — it's one input among many, not the output. We verified the recovery path with a test that simulates exactly the original failure (a loop whose every turn returns empty) and asserts a real answer still comes out. It does.

The lesson generalizes well beyond this one feature. Powerful, exploratory components — REPLs, recursive calls, tool chains — are generative, and generative things sometimes generate nothing. The architecture's job is to make sure "nothing" can never reach the user as if it were "something." You don't get there by making the component more reliable. You get there by never letting it hold the pen.

Why this is the natural next step

There's a tidy progression here, and it's the through-line of how we think about context.

First you learn the context window is not a database — you stop using it as durable storage and move persistence into real infrastructure. Then you learn it's a curation problem — you stop dumping everything in and surgically assemble a tight briefing for each request. Both of those are about getting the right small thing into the prompt.

Recursive Language Models are the move you make when the right thing isn't small. When the relevant context is a 2M-character log, a sprawling codebase, a thousand-row export — you can't curate it down without already knowing the answer. So you stop trying to move the data to the model, and you give the model tools to reach the data where it lives. peek, grep, execute, recurse. The prompt holds the question and the findings, never the haystack.

The context window is an extraordinary reasoning instrument. It is a terrible place to put a database, a mediocre place to put an encyclopedia, and exactly the wrong place to put two million characters you were hoping the model would count correctly. Treat it as the last mile — the place where a curated question meets distilled, verified findings — and give the model real tools for everything upstream of it.

Don't read the context. Query it. And never let the query own the answer.

Enjoyed this post?

All posts Try AitherOS

Back to blog

engineeringarchitecturecognitionllmagentscontext

Don't Read the Context. Query It.

June 5, 202612 min readAitherium