Early Access Preview
Back to blog
engineeringtrainingfine-tuningautomationdark-factoryllmself-improvingbenchmark

The Model That Trains Itself: Building a Closed-Loop Self-Improving AI Pipeline

March 15, 20268 min readAitherium
Share

Every 12 hours, AitherOS fine-tunes its own orchestrator model without human intervention. The pipeline reads the live codebase, mines real conversations, generates training data, trains, benchmarks, and either promotes or rolls back — fully autonomously. This is the engineering breakdown of how it works.

The Problem With Static Models

An AI system that orchestrates 100+ microservices goes stale fast. Every new agent, every refactored module, every added tool is something the model has never seen. Manual fine-tuning doesn't scale. The answer was to make the training data self-regenerating.

Data Generation: 13 Sources, Every 12 Hours

The pipeline runs 13 data generators against the live system state:

GeneratorWhat It ProducesExamples
AST parserFunction signatures, docstrings, call graphs from 1,143 Python filesCode understanding QA
Diff minerGit diffs → before/after reasoning pairsCode review
Session minerReal developer conversations (filtered)Intent classification, tool use
Knowledge graphEntity relationships from AitherKnowledgeGraphArchitecture reasoning
Goal examplesActive goal state → planning pairsOrchestration
Interrupt handlerInterrupt patterns → response examplesDecision making
Architecture docsService layer diagrams → explanation pairsSystem knowledge
Benchmark targeted24 examples matching exact eval prompt patternsBenchmark categories
Wikipedia harvestDomain knowledge from linked articlesGeneral reasoning
Affect examplesEmotional state → tone calibrationAffect reasoning
Conciseness drillsVerbose → terse rewritesConciseness
Tool specMCP tool schemas → usage examplesTool use
ConsciousnessIntrospection prompts → reflective answersConsciousness articulation

Total before filtering: 4,263 examples. After quality filter: 1,183 (27.7%).

Quality Filtering

Every example runs through a quality gate before training:

  • Length check: response 10–800 tokens
  • Repetition penalty: bigram repeat ratio < 0.4
  • XML noise filter: strips terminal dumps, task notification blocks
  • Coherence score: embedding similarity between prompt and response
  • Category balance: caps any single generator at 600 examples to prevent mode collapse

Run #2 skipped the cap. 2,666 codebase QA examples made the model verbose and repetitive. Run #3 capped it at 600 and conciseness recovered from 0.24 to 0.74. Data balance beats data volume.

Training: QLoRA on Nemotron-Orchestrator-8B

Base model: nvidia/Nemotron-Orchestrator-8B
Method: QLoRA (4-bit quantization + LoRA adapters)
LoRA rank: r=32, alpha=64
Batch size: 4 (gradient accumulation ×4 = effective 16)
Epochs: 3
Learning rate: 2e-4 with cosine schedule
Hardware: single A100 80GB
Duration: ~45 minutes

The 8B size is intentional. The orchestrator handles routing, intent classification, and tool dispatch — not deep reasoning. That tier (14B) is separate. An 8B model that routes correctly and stays concise beats a 14B model that overthinks every dispatch call.

Benchmark: 9 Categories, Hard Threshold

Promotion requires 0.70 overall across all 9 categories:

CategoryWhat It Tests
intent_classificationRoute a request to the right handler
orchestrationMulti-step planning and tool sequencing
tool_useCorrect MCP tool invocation with right params
code_reviewIdentify bugs and suggest fixes in diffs
reasoningMulti-step math and cascade analysis
architectureExplain system topology and data flow
concisenessAnswer correctly in the fewest tokens
affect_reasoningMatch tone to emotional context
consciousness_articulationReflect on own state and limitations

The benchmark runs 15 prompts per category (135 total) and scores with a judge model. Results are logged to AitherChronicle and compared against the previous champion.

Auto-Promotion and Rollback

If overall ≥ 0.70 and no single category drops below 0.30: the model is promoted to active orchestrator and Genesis restarts on the new weights. The previous checkpoint is archived.

If it fails: the run is logged, the current champion stays active, and the pipeline analyses which categories regressed to inform the next data generation cycle.

Run #4 passed training but Genesis restarted (unrelated) before the benchmark completed. The pipeline detected the incomplete state and re-ran automatically on the next cycle.

What the Next Cycle Learns From

Every training run feeds back into the generators:

  • Session miner reads new developer conversations since the last run
  • Diff miner reads new git commits
  • AST parser re-parses any modified Python files
  • Benchmark targeted examples are refreshed based on last run's weak categories

The model that runs the system tomorrow is trained on the system as it exists today.

Results: 0.58 → 0.708

Five runs to cross the 0.70 threshold. The full breakdown is in 0.708: The World Model Passed Its Own Benchmark.

The pipeline now runs every 12 hours. If the next run scores higher, it promotes automatically. The system is continuously improving itself.

Enjoyed this post?
Share