Early Access Preview
Back to blog
engineeringtrainingfine-tuningarchitectureself-improvingdeep-dive

The Model That Never Stops Learning: Continuous Microtraining in AitherOS

March 16, 20266 min readAitherium
Share

We had a training pipeline. It worked. Every 12 hours, the dark factory would harvest data, train a LoRA adapter, merge it into the base model, restart vLLM, and benchmark. If the score improved, the new model was auto-promoted. If it regressed, it rolled back.

That pipeline got our orchestrator from 0.58 to 0.708. It works.

But there's a problem with discrete training cycles: the model learns in jumps. You correct it on Monday, and the correction enters the training data. On Friday at 2 AM, the weekly finetune picks it up. The following Monday, the model finally knows. That's a week of latency between "you told it something" and "it learned."

We wanted something different. We wanted the model to just... slowly become better, all the time, without anyone noticing.

What Microtraining Is

Microtraining is a background daemon that runs a 30-minute tick cycle:

  1. Harvest new examples from 4 sources (session learnings, conversations, daydreams, knowledge graph)
  2. Accumulate into a micro-batch buffer until we have 10-50 examples
  3. Train a tiny LoRA adapter (rank 8, 50-100 steps, learning rate 1e-6)
  4. Hot-load the adapter into the running vLLM instance (no restart)
  5. Evaluate — if loss regressed, auto-rollback the adapter
  6. Sleep 30 minutes. Repeat.

The key numbers: rank 8 LoRA (not the rank 32 we use for full finetunes), learning rate of 0.000001 (a thousand times smaller than standard training), and 50-100 gradient steps per batch. At these settings, catastrophic forgetting is essentially impossible. The model changes so slowly that any individual micro-batch is undetectable — but over days and weeks, the cumulative drift is significant.

Where the Data Comes From

Four sources, in priority order:

1. Session Learnings (highest signal)

When the same failure pattern occurs 3+ times in a conversation, our SessionLearner promotes it to persistent memory and exports it to a training JSONL file. These are the corrections that matter most — things the system got wrong repeatedly and a human had to fix. Each one becomes a training example: "When handling X, avoid Y because Z."

2. Conversations

Every successful conversation pair (user question + assistant response) from the ConversationStore is a potential training example. We filter for sessions with 4+ exchanges (too short = not enough signal), skip very short messages, and cap response length at 2,000 tokens. Quality score: 0.7 — lower than session learnings because we don't know if the user was actually satisfied.

3. Daydreams

AitherOS agents daydream during idle periods — producing musings, reflections, and personality-colored internal monologue. The DaydreamCorpus collects these into JSONL. They train the model's voice and personality, not its factual knowledge. Quality score: 0.6 — the floor.

4. Knowledge Graph

The GraphTrainingHarvester pulls from 11 sources: CodeGraph (code patterns with call graphs), MemoryGraph (episodic and semantic knowledge), CrossDomainLinker (multi-domain reasoning chains), expedition traces, affect states, inner life reflections, and more. This is the broadest source but the noisiest, so it's weighted lowest.

All four sources are deduplicated, quality-filtered (floor: 0.6), and sorted by quality score. The top 50 examples form a micro-batch.

Why Rank 8 and Not Rank 32

Our full finetune pipeline uses rank 32, alpha 64, learning rate 2e-4. That's appropriate for 1,000+ examples trained over 3 epochs. It's designed to make a measurable change to the model's behavior.

Microtraining is designed to make an unmeasurable change to the model's behavior — per batch. The cumulative effect over hundreds of batches is what matters. Rank 8 means we're updating fewer parameters per layer. Learning rate 1e-6 means each update is tiny. The math:

  • Full finetune: ~67M trainable parameters, 500 steps, LR 2e-4
  • Microtrain: ~17M trainable parameters, 50 steps, LR 1e-6

The effective weight change per micro-batch is roughly 1/4000th of a full finetune. That's the point. You can't break a model with 1/4000th of a training run. But you can steer it.

The Adapter Hot-Load Problem

Traditional fine-tuning goes: train LoRA -> merge into base weights -> restart serving infrastructure. For a weekly cycle, that's fine. For a 30-minute cycle, restarting vLLM every half hour is unacceptable.

The solution: don't merge. Keep the adapter separate and hot-load it.

VLLMSwap (our model hot-swap service on port 8176) now exposes a /lora/reload endpoint. When microtraining completes a batch, it saves the adapter checkpoint and POSTs to this endpoint. VLLMSwap registers the adapter path — on the next model swap or restart, the adapter is loaded alongside the base model. When vLLM adds native runtime LoRA addition (it's on their roadmap), we'll proxy to that and get true zero-downtime adapter swaps.

We keep the last 10 adapter versions. Older ones are pruned automatically.

Regression Protection

Every micro-batch could theoretically make the model worse. We protect against this with:

  1. Quality floor: Only examples scoring >= 0.6 enter training. Session learnings (pre-vetted by the SessionLearner promotion threshold) score 0.8. Random conversation pairs score 0.7. Junk gets filtered.

  2. EMA loss tracking: We maintain an exponential moving average of training loss across batches. If a new batch's loss exceeds the running average by more than 0.1, the adapter is automatically rolled back and the batch is discarded.

  3. Source diversity: By mixing session learnings, conversations, daydreams, and graph data, no single source can dominate. A bad batch of conversations won't overwrite good session learnings.

  4. Cursor persistence: The daemon tracks what it's already harvested. If it crashes and restarts, it picks up where it left off — no duplicate training on the same data.

The Full Loop

Here's the complete data lifecycle, from user interaction to model improvement:

User talks to AitherOS
    |
    v
ChatEngine processes the conversation
    |
    +---> FluxEmitter.emit(CONV_EXCHANGE)
    |         |
    |         v
    |     KnowledgeIngester --> AitherKnowledgeGraph (episodic nodes)
    |
    +---> SessionLearner tracks failure patterns
    |         |
    |         v (3+ failures)
    |     Generate learning --> inject into context
    |         |
    |         v (5+ injections)
    |     Promote to MemoryGraph + export to session_learnings.jsonl
    |
    +---> ConversationStore saves the exchange
    |
    v
MicroTrainingDaemon (every 30 minutes)
    |
    +---> Harvest from session_learnings.jsonl
    +---> Harvest from ConversationStore
    +---> Harvest from DaydreamCorpus
    +---> Harvest from GraphHarvester
    |
    v
Quality filter (>= 0.6) + dedup + sort by quality
    |
    v
UnslothTrainer: rank 8 LoRA, 50 steps, LR 1e-6
    |
    v
EMA loss check --> rollback if regression
    |
    v
VLLMSwap /lora/reload --> adapter registered
    |
    v
FluxEmitter: MICROTRAIN_COMPLETE event
    |
    v
Next user conversation benefits from adapted model

The latency from "user correction" to "model adapted" drops from 7 days (weekly finetune) to roughly 1-3 hours (time for SessionLearner promotion + next microtrain tick). For conversation data, it's about 30-60 minutes.

Configuration

Everything is tunable in agent_kernel.yaml:

microtraining:
  enabled: true
  tick_interval_seconds: 1800      # 30 minutes
  micro_batch_size: 50             # examples per batch
  min_examples_to_train: 10       # don't train on scraps
  learning_rate: 1.0e-6           # glacial
  max_steps_per_batch: 100        # quick
  lora_rank: 8                    # small
  quality_floor: 0.6              # filtered
  rollback_on_regression: true    # safe
  max_adapter_versions: 10        # pruned

Want faster adaptation? Lower the tick interval and min_examples threshold. Want safer adaptation? Raise the quality floor and lower the regression threshold. Want to disable it entirely? One boolean.

What This Means

The coarse-grained training pipeline isn't going away. The weekly finetune with rank 32 and 1,000+ examples still makes the big jumps — new capabilities, new agent behaviors, architecture understanding. That's the heavy lift.

Microtraining handles the fine grain. The corrections. The personality drift. The conversational patterns that make you feel like the model knows you. It's the difference between a model that was trained on data like yours, and a model that was trained on your data.

The model never stops learning. It just does it quietly, in the background, a few parameters at a time.

Enjoyed this post?
Share