A language model finishes training and stops learning. Everything it "knows" is baked into weights that never move again. Point one at a task it hasn't seen — a customer's oddly-shaped dataset, a game with rules it was never taught — and it does the confident, useless thing: it pattern-matches to the nearest memory and gets stuck.

That gap is the whole problem. Real competence isn't a bigger cache of answers; it's the ability to acquire a new skill on the spot — to look at unfamiliar data, form a model of how it behaves, try something, and get better from the result. Four pieces of our stack have been converging on exactly that, from different directions. This is how they fit together.

1. Learning from new data without retraining

The first instinct for "the model should learn from this data" is to fine-tune. That's slow, needs a GPU, and produces a new frozen artifact. For structured data — tables and time-series, the shape most real businesses actually run on — there's a better move: in-context foundation models.

We integrated Google's TabFM (zero-shot tabular classification and regression) and TimesFM (zero-shot time-series forecasting) as a first-class agent capability. The trick is that "training" here isn't gradient descent at all: you hand the model a support set of labeled examples, and it predicts in a single forward pass. New labeled rows change the answer immediately — no training run, no waiting.

On top of that we built the part that makes it learning rather than just inference: a self-teaching curation loop. When an agent is taught new examples, we don't blindly append them. We build a candidate support set, evaluate it against a held-out split, and promote it only if accuracy doesn't regress — otherwise we roll back and record the failure. Each task's support set is a versioned, outcome-scored memory that grows when growing helps and holds steady when it doesn't. A falling score flags the task for attention. It's the accept-or-rollback discipline of a good CI system, applied to what an agent knows.

agent.tabular_teach(task="lead-scoring", rows=[...], target="converted")
  → candidate = prior_support ∪ new_rows
  → eval on held-out → accept if accuracy holds, else roll back
  → next tabular_classify on "lead-scoring" is already better

This shipped as a running service and a set of agent tools, and we validated it end-to-end in a container on the fleet — the real 1.6-billion-parameter model, loaded and serving, returning correct classifications and a time-series forecast that cleanly extrapolated a trend over HTTP. Zero gradient steps to adapt. One forward pass per prediction.

2. Planning against a world model it builds itself

Adapting to data tells you what things are. Acting well needs the other half: a sense of what happens next. So we took our internal tree-search planner — a Monte-Carlo Tree Search engine that had been proven inside the platform — and ported it into the agent kit as a reusable library, behind a clean MCTSEnvironment protocol.

The interesting part is the world model. Search is only as good as its ability to imagine the consequences of a move, and we didn't want to bolt on a heavy learned simulator with its own latency. Instead there's an observed-transition model: a cheap, online memory of "in this state, this action led there." On a hit it replays the exact observed outcome; on a miss it falls back to "assume nothing changed" and flags high uncertainty — which biases the search to explore precisely where it hasn't learned yet. The model of the world sharpens as the agent plays.

A miss isn't a failure. It's the search discovering the edge of what it knows — and steering toward it.

Crucially, this went in without touching the eleven existing consumers of the original engine. The learned-transition and value seams are optional and default-off, so the classic path stays byte-for-byte identical while the new one opts into learning. Old code doesn't pay for the upgrade; new code gets to plan with a model that improves across runs — and that persistence across runs is exactly the "don't relearn what you already figured out" property continuous learning is supposed to give you.

3. Configuring itself to whatever's available

An agent that learns and plans is still useless if it can't figure out where it's running. A laptop with a local model is a different world from a cloud tenant with a full reasoning-and-vision fleet. So the newest layer is a self-bootstrap preflight: before it does anything, the agent probes its own environment.

What models can it reach — a local vLLM or Ollama, or the hosted platform? Which roles are covered: orchestration, reasoning, vision, voice? Is the structured-ML teach surface present? It answers those honestly and then builds itself from a spec: one declarative file describes the agent, its prompts, and the capability packs it needs — and a required pack that comes up with zero tools fails loud rather than degrading in silence. Memory and tier preferences shift how it runs, never whether it's allowed to.

The result is an agent that treats onboarding as a first-class step: report what this machine can do, then assemble the right version of itself for the job in front of it. Same spec, laptop or datacenter — different assembly.

4. Reaching into the real interface — AitherPilot

Learning, planning, and self-assembly all point outward eventually: the agent has to do something in software built for humans. AitherPilot is our computer-and-browser-use pack — it folds the scattered browser and desktop-control surfaces into a single ReAct loop the agent can drive: see the screen, choose an action, take it, observe the result.

It's the hands. Paired with the rest of the stack it closes the gap between "the agent decided what to do" and "the agent did it," on real applications rather than a sandboxed API. Premium and license-gated, because computer-use is powerful enough to deserve a clear boundary.

5. The proving ground: ARC-AGI-3

You can't claim an agent learns on the fly by pointing at a benchmark it could have memorized. That's the whole point of the ARC-AGI-3 interactive reasoning benchmark: brand-new games, no instructions, scored on how quickly an agent picks up skills it has never seen. It is built specifically to reward fluid intelligence and punish recall. Which makes it the honest test for everything above.

Look at what a single ARC-AGI-3 game demands, and the four capabilities line up almost suspiciously well:

Stage	Capability	What happens
Probe	Bootstrap	Detect available models & roles; assemble the solver for this environment
Perceive	Adapt	Read the grid; form a model of the pieces from a handful of examples, in-context
Search	Plan	MCTS over an observed-transition world model; explore where dynamics are unknown
Move	Act	Take the action — the same ReAct loop that drives real software drives the game
Score	Learn	The outcome sharpens the world model and the support set; the loop tightens

The ARC solver now runs from a declarative agent spec, drives tree-search planning through the new MCTS library, and persists what it observes so a second run doesn't relearn the first run's lessons. That's not four demos stapled together — it's one loop.

The whole thing: self-bootstrapping, continuous learning

Put plainly: an agent that probes its environment and assembles itself, adapts to new data without a training run, plans against a world model it improves as it goes, acts in real interfaces, and learns from every outcome — with each turn feeding the next. Each piece is useful alone. The point is the circuit: none of them frozen, all of them compounding.

Not a bigger model that knows more. A smaller loop that gets better.

We're holding ourselves to honesty about maturity, so here's the real status — what's been validated live, what's shipped and under test, and what's designed and landing:

Capability	What it does	Status
TabFM / TimesFM teach	In-context structured-data learning; accept-or-rollback support-set curation	Live in container
Agent ML tools	classify / regress / forecast / teach, published in the agent kit	Shipped
MCTS library + world model	Reusable tree search; cheap online observed-transition model	Shipped to the kit
Self-bootstrap / agent-from-spec	Capability probe + declarative assembly; fail-loud packs	Shipped
ARC-AGI-3 solver	Spec-driven solver planning under MCTS, persisting observations	Wired & running
Self-bootstrap preflight	Honest per-machine capability report & onboarding	In design
AitherPilot computer-use	Unified browser/desktop ReAct surface	Pack, gated

The frozen agent is a local maximum. The interesting frontier isn't a model that has memorized more of the world — it's an agent that can walk into a corner of the world it has never seen, and get good at it before your coffee cools. That's the thing we're building, and ARC-AGI-3 is how we keep ourselves honest about whether it works.

Enjoyed this post?

All posts Try AitherOS

Back to blog

agentscontinuous-learningmctsworld-modelsarc-agifoundation-modelsself-bootstrapping

Agents That Learn the Game, Not Just Recall It

July 4, 20269 min readAitherium