An agent should get better at a task the second time it does it. Ours didn't. It would drive a browser, authenticate as the owner, discover a SaaS app's data, mirror it — and then, on the next run, do the entire discovery dance again from zero. Smart in the moment, amnesiac across moments.

This is the story of fixing that properly — turning a one-off "extraction recipe" into a platform-wide continual-learning capability where a procedure learned once becomes a scored memory, a real skill, a callable tool, an A2A capability, and a distributable agent pack. It's also a story about three self-inflicted wounds we had to own along the way, because the interesting part of engineering is rarely the happy path.

Part 1 — "It produced a shitty HTML page, is all?"

The agent's job was "rebrand and migrate" a target app: log in, mirror the data, redesign it, deploy a new version. The mirror step authenticated correctly but only ever pulled four hard-coded API paths. On a single-page app, that returns the landing page and not much else. The deliverable was thin.

So we replaced the hard-coded paths with a reasoning+vision browser navigator that drives the page like a person — observe the DOM, reason about the next action, click, capture the XHRs the app fires, repeat — and remembers the recipe it discovered so the next run can replay it. On the target app (garg.aitherium.com) it went from a landing blurb to 112 records: people, projects, conversations, documents.

Better. Not done.

Part 2 — "There are more people records than that."

The export said people: 1. The owner knew there were seventeen. We'd captured the right endpoint — /api/people — but it returns an envelope: { "company_staff": [ …17 people… ], "workspace_users": [], "counts": {…} }. We were counting the envelope as one record instead of exploding the nested collection. One _explode_records function later: 1 → 17.

Then the deeper question: "Are we even getting all the pages in the app?" We were getting 2 of 17 panels. The recipe was chat-and-people shaped because that's where the navigator happened to wander.

The fix was to stop guessing and ask the app what it exposes. Most of these backends are FastAPI, and FastAPI publishes a complete OpenAPI spec. Enumerate every GET collection endpoint, sweep them in batched in-session fetches, harvest IDs to fill the /x/{id} detail routes. Result, measured live:

9 entity types → 195. 683 records across 250 endpoints. Users, tools, contacts, projects, staff, documents, invoices, audit, forms, analytics, calendar — the whole surface.

Part 3 — three mistakes worth admitting

Mistake one: a fallback that faked working. When recipe recall missed, we fell back to a fuzzy semantic search that could return a recipe — just not necessarily the right one. That's worse than failing: it hides the miss and looks healthy. We deleted it. Recall now has exactly one path: fetch by stable id, or return an honest None and rediscover. No silent stand-ins.

Mistake two: we corrupted a memory by writing during a read. To migrate old recipes we had recall re-save them in the new format on the way out. A write-during-read raced and emptied the stored memory to a zero-length string, which then poisoned every future upsert on that key. The lesson is old and we relearned it the hard way: reads don't mutate. The store now adapts legacy data in memory and never writes during a recall.

Mistake three: we locked ourselves out of our own API. We'd (correctly) noticed the apps were serving their full /openapi.json schema to anyone, unauthenticated, and gated it behind a real session. Then our own acquisition started getting 401s and silently degrading — because its minted browser session wasn't a validated app session. The right fix wasn't to punch a hole in the gate; it was to make the agent authenticate (dev-login) and pass through it like any legitimate owner. Lock the door and carry a key.

Part 4 — the real point: learning that converges

A recipe you save but never trust is theater. The unlock was a stable identity plus outcome scoring: every learned procedure lives under a deterministic key, gets upserted in place (no duplicate pile-ups), and carries a score that moves with reality — reinforced when a run succeeds, decayed when it doesn't, on top of the memory service's existing strength/decay tiers. Recall fetches by that key, so a large, grown procedure can't get out-ranked and dropped by fuzzy similarity. The recipe converges: explore until the surface is covered, then fast-path.

Once a procedure is real and scored, it can graduate:

Promote → Skill + Tool. Cross a quality gate (success rate and run count) and the procedure registers as a discoverable skill and a callable tool whose handler replays its steps.
Package → Agent Pack. Emit a self-contained .toolpack.yaml that embeds the procedures, plus A2A skill descriptors so a running agent advertises the learned capability and other agents can call it.
Bias the planner. Winning plans are cached as procedures, and their outcomes feed the MCTS value-oracle's priors — so the planner is steered toward proven tool sequences. We deliberately do not blind-replay a cached plan; tools and APIs drift, and replaying a stale plan is just a slower way to be wrong.
Feed evolution. Proven procedures are harvested into the training corpus, so the nightly model-evolution loop learns to prefer what already works.
Close the loop in the SDK. Agents now recall relevant skills before acting and extract + save a new skill after a successful multi-tool run — reinforcing the ones they reused.

The satisfying part: ~80% of the pieces already existed — three separate, disconnected learning loops and all the delivery rails (skills, tools, A2A, packs). The work was mostly unification, built behind feature flags as a strangler so the existing loops kept running while everything moved onto one substrate.

What it adds up to

An agent drives an app it's never seen, authenticates, enumerates the real API surface, mirrors everything, and writes a scored recipe. Run it again and it replays in seconds. Let the recipe prove itself a few times and it becomes a skill you can call, a tool other agents can invoke, a pack you can ship, and a prior that makes the planner smarter. Outcomes flow back as reinforcement; the winners get trained into the models.

That's the difference between an agent that's clever once and a system that gets better every time it's used. Ours stopped starting from scratch.

Honest footnote: the substrate is flag-gated and shipped to develop with 44 tests. The mistakes above are in the commit history on purpose — the postmortem is the product.

Enjoyed this post?

All posts Try AitherOS

Back to blog

engineeringagentscontinual-learningmemorymctsarchitecture

The Day Our Agents Stopped Starting From Scratch

June 4, 202611 min readAitherium