Every 12 hours, AitherOS fine-tunes its own orchestrator model without human intervention. The pipeline reads the live codebase, mines real conversations, generates training data, trains, benchmarks, and either promotes or rolls back — fully autonomously. This is the engineering breakdown of how it works.

The Problem With Static Models

An AI system that orchestrates 100+ microservices goes stale fast. Every new agent, every refactored module, every added tool is something the model has never seen. Manual fine-tuning doesn't scale. The answer was to make the training data self-regenerating.

Data Generation: 13 Sources, Every 12 Hours

The pipeline runs 13 data generators against the live system state:

Generator	What It Produces	Examples
AST parser	Function signatures, docstrings, call graphs from 1,143 Python files	Code understanding QA
Diff miner	Git diffs → before/after reasoning pairs	Code review
Session miner	Real developer conversations (filtered)	Intent classification, tool use
Knowledge graph	Entity relationships from AitherKnowledgeGraph	Architecture reasoning
Goal examples	Active goal state → planning pairs	Orchestration
Interrupt handler	Interrupt patterns → response examples	Decision making
Architecture docs	Service layer diagrams → explanation pairs	System knowledge
Benchmark targeted	24 examples matching exact eval prompt patterns	Benchmark categories
Wikipedia harvest	Domain knowledge from linked articles	General reasoning
Affect examples	Emotional state → tone calibration	Affect reasoning
Conciseness drills	Verbose → terse rewrites	Conciseness
Tool spec	MCP tool schemas → usage examples	Tool use
Consciousness	Introspection prompts → reflective answers	Consciousness articulation

Total before filtering: 4,263 examples. After quality filter: 1,183 (27.7%).

Quality Filtering

Every example runs through a quality gate before training:

Length check: response 10–800 tokens
Repetition penalty: bigram repeat ratio < 0.4
XML noise filter: strips terminal dumps, task notification blocks
Coherence score: embedding similarity between prompt and response
Category balance: caps any single generator at 600 examples to prevent mode collapse

Run #2 skipped the cap. 2,666 codebase QA examples made the model verbose and repetitive. Run #3 capped it at 600 and conciseness recovered from 0.24 to 0.74. Data balance beats data volume.

Training: QLoRA on Nemotron-Orchestrator-8B

Base model: nvidia/Nemotron-Orchestrator-8B
Method: QLoRA (4-bit quantization + LoRA adapters)
LoRA rank: r=32, alpha=64
Batch size: 4 (gradient accumulation ×4 = effective 16)
Epochs: 3
Learning rate: 2e-4 with cosine schedule
Hardware: single A100 80GB
Duration: ~45 minutes

The 8B size is intentional. The orchestrator handles routing, intent classification, and tool dispatch — not deep reasoning. That tier (14B) is separate. An 8B model that routes correctly and stays concise beats a 14B model that overthinks every dispatch call.

Benchmark: 9 Categories, Hard Threshold

Promotion requires 0.70 overall across all 9 categories:

Category	What It Tests
intent_classification	Route a request to the right handler
orchestration	Multi-step planning and tool sequencing
tool_use	Correct MCP tool invocation with right params
code_review	Identify bugs and suggest fixes in diffs
reasoning	Multi-step math and cascade analysis
architecture	Explain system topology and data flow
conciseness	Answer correctly in the fewest tokens
affect_reasoning	Match tone to emotional context
consciousness_articulation	Reflect on own state and limitations

The benchmark runs 15 prompts per category (135 total) and scores with a judge model. Results are logged to AitherChronicle and compared against the previous champion.

Auto-Promotion and Rollback

If overall ≥ 0.70 and no single category drops below 0.30: the model is promoted to active orchestrator and Genesis restarts on the new weights. The previous checkpoint is archived.

If it fails: the run is logged, the current champion stays active, and the pipeline analyses which categories regressed to inform the next data generation cycle.

Run #4 passed training but Genesis restarted (unrelated) before the benchmark completed. The pipeline detected the incomplete state and re-ran automatically on the next cycle.

What the Next Cycle Learns From

Every training run feeds back into the generators:

Session miner reads new developer conversations since the last run
Diff miner reads new git commits
AST parser re-parses any modified Python files
Benchmark targeted examples are refreshed based on last run's weak categories

The model that runs the system tomorrow is trained on the system as it exists today.

Results: 0.58 → 0.708

Five runs to cross the 0.70 threshold. The full breakdown is in 0.708: The World Model Passed Its Own Benchmark.

The pipeline now runs every 12 hours. If the next run scores higher, it promotes automatically. The system is continuously improving itself.

Enjoyed this post?

All posts Try AitherOS

Back to blog

engineeringtrainingfine-tuningautomationdark-factoryllmself-improvingbenchmark

The Model That Trains Itself: Building a Closed-Loop Self-Improving AI Pipeline

March 15, 20268 min readAitherium