There is a category of engineering problem that sounds impossible until you actually do the math. Fine-tuning a production-grade orchestrator model on domain-specific data, on consumer hardware, in under 30 minutes, and having the result actually be better than the base model -- that sounds like it belongs in a marketing deck. It is not. It is a training run we completed today, and this post documents exactly what we did, why each decision was made, and what the results imply.

The Problem

AitherOS runs dozens of microservices across 12 architectural layers. Everything from GPU scheduling to agent dispatch to security monitoring to knowledge graphs. The central orchestrator model -- the one that routes every LLM call, selects tools, classifies intent, and decides which agent handles which task -- is nvidia's Nemotron-Orchestrator-8B, served locally by vLLM.

The base model is excellent at general agentic workflows. It was trained with GRPO (Group Relative Policy Optimization) on top of Qwen3-8B specifically to be good at function calling, tool routing, and multi-step orchestration. On the Berkeley Function Calling Leaderboard, it outperforms GPT-4o. On BFCL v3 and Nexus, it scores competitively with models 10x its size.

But it does not know our system. It does not know how our services communicate, how our security model works, how our boot sequence resolves dependencies, or how our intent classification taxonomy routes effort. Every orchestration call requires extensive system prompting. Context windows fill up with architecture documentation that a properly trained model would already have internalized.

The solution: teach the model to reason like our system architects. The question is how to do it without destroying what makes it good.

The Key Insight: Graphs, Not Files

The obvious approach is to scrape your codebase into JSONL files and feed them into a training pipeline. Extract function signatures, pair them with docstrings, call it a dataset. This is what most fine-tuning tutorials suggest.

It is also wrong.

Raw file scraping produces junk. We know this because we tried it. Our first-pass codebase extraction produced 119,804 examples. Analysis showed 27% were under 100 characters -- one-line function stubs with no educational value. Another 46% were medium-quality but lacked structural context. Only 27% had outputs substantial enough to teach reasoning patterns.

The problem is not volume. The problem is that flat file parsing destroys the relationships that make code understandable. A function signature tells you what something does. A call graph tells you why it exists. A cross-domain edge between a service definition and its health check tells you how the system thinks about reliability. These relationships are the difference between memorizing syntax and understanding architecture.

AitherOS already has this relationship data. We have 12 faculty graphs that index every aspect of the system. The training data was already there. We just needed to harvest it properly.

Training Data Strategy: Graph-Native Harvesting

The guiding principle: train reasoning patterns, architecture thinking, and coding capability -- NOT system-specific facts. The knowledge graph and the context assembly system handle facts at runtime. The model needs to learn how to think, not what to remember.

Source 1: Code Graph (Code Understanding + Architecture)

The code graph is our AST-parsed code index with full call graphs. Every Python function is a node with:

Signature, docstring, body preview
Cyclomatic complexity score
Call graph: what it calls, what calls it
Parent class, base classes, imports

From this graph, we extract three categories of training examples:

Code explanation examples. Given a function signature, explain what it does, its parameters, return value, and design patterns. The code graph data includes the actual docstring, complexity metrics, and dependency information -- the training pair teaches the model to produce explanations that reference real architectural context, not generic descriptions.

Code review examples. Given a function body with its complexity score and dependency count, produce a structured review covering correctness, error handling, performance, and security. A function with complexity 8 and 12 outgoing calls gets a different review than one with complexity 2 and no dependencies. The graph metrics ground the review in observable structure.

Call graph analysis. Given a function's callers and callees, analyze its architectural role. A function with high fan-in (many callers) is a utility -- changes have high blast radius. A function with high fan-out (many calls) is an orchestrator -- it coordinates sub-operations. The model learns to reason about system structure from graph topology, not from reading documentation.

Quality gate: minimum complexity 3, minimum 5 lines. This filters trivial getters and setters that teach nothing.

Source 2: Memory Graph (Knowledge + Learning Patterns)

The memory graph stores everything the system has learned across sessions. Three memory types, each producing different training examples:

Episodic memories become "what was learned" examples. Given a past interaction, summarize the key insight and how it applies to future tasks. These teach the model to extract transferable knowledge from specific experiences.

Semantic memories become concept explanation examples. These are facts and knowledge the system has crystallized through repeated exposure. The memory strength (a decay curve) acts as a quality signal -- strongly reinforced memories are higher quality training data.

Procedural memories become step-by-step instruction examples. How-tos and standard operating procedures, learned from actual system operations.

Edge-based synthesis examples. When a memory node has multiple neighbors connected by DERIVED_FROM, REINFORCES, or RELATED edges, we extract a knowledge synthesis example: given these related pieces of knowledge, produce a coherent understanding. This teaches the model to combine information across domains -- exactly the reasoning pattern that distinguishes a good orchestrator from a simple router.

Quality gate: minimum memory strength 0.3 (skip fading memories that the system itself has deprioritized).

Source 3: Cross-Domain Linking (Multi-Domain Reasoning)

This is the highest-value data source. The cross-domain linker automatically detects relationships across code, configuration, tests, memory, infrastructure, and documentation. Each edge has a confidence score and evidence string.

A cross-domain edge from a code function to a configuration key, with confidence 0.85 and evidence like "a service registration function reads the port mapping from the service registry" -- this becomes a training example that teaches the model how code and configuration interact. The model learns to trace implications across system boundaries.

These examples are unreplicable. They encode the specific structure of our system's cross-cutting concerns. A model trained on them develops intuition for questions like "if I change this config, what code is affected?" -- the kind of reasoning that takes a new engineer months to develop.

Source 4: Expedition Graph (Task Decomposition + Agent Orchestration)

Expeditions are multi-session projects with full task decomposition, agent assignment, decision records, and gate outcomes stored in a local database. From completed expeditions, we extract:

Task decomposition examples. Given a project goal, decompose it into phases and tasks with agent assignments. The expedition graph preserves which agent handled which task, how many attempts it took, and whether it succeeded -- ground truth for orchestration training.

Decision reasoning examples. When an expedition required a decision (architecture choice, technology selection, trade-off resolution), the rationale is stored. These teach the model structured decision-making with evidence.

Source 5: Gateway + MCP Telemetry (External Agent Patterns)

The external gateway and the MCP gateway collect anonymized telemetry from external ADK agents. This data -- backend types, model families, tool call sequences -- teaches the model how real agents interact with the platform in production.

Tool call chains from MCP sessions are particularly valuable: they show multi-step workflows that real users execute, not synthetic benchmarks. "The agent searched code, then read context, then analyzed the result" is a real tool-use pattern worth learning.

All gateway data is anonymized via the session anonymizer before training: agent IDs are hashed, usernames stripped, paths normalized, secrets redacted.

Source 6: Analytics Outcomes (Intent Routing Feedback)

The analytics pipeline stores every LLM call outcome: intent type, effort level, model used, success/failure, latency, quality score. These are not individual training examples -- they are calibration data that the model uses to learn routing decisions. "Intent X with effort level 6 succeeds 94% of the time with the orchestrator model but only 67% with the small model" is the kind of statistical pattern that should be in the weights.

Source 7: DPO Preference Pairs (Quality Alignment)

Two sources of preference pairs for Direct Preference Optimization (Phase 2):

Quality gate verdicts. When the frontier quality gate (cloud-based) reviews a delivery and marks it not approved, the system retries with quality feedback. The approved retry becomes the "chosen" response; the original rejection becomes the "rejected" response. These pairs teach the model what quality looks like without explicit labeling.

User feedback. Thumbs up/down ratings on responses, stored in the analytics pipeline as outcome records. Rating >= 4 = chosen, rating <= 2 = rejected. When paired, these teach the model to prefer responses that actual users found useful.

Source 8: Synthetic Generation (Teacher Models)

For categories where graph data is sparse, we generate synthetic examples using local teacher models (deepseek-r1:14b via the model scheduler) or Gemini API (free tier fallback). Five categories: architecture decisions, reasoning chains, code review, tool use, PowerShell patterns.

The synthetic generator is local-first: model scheduler, then Ollama, then Gemini API. No cloud dependency required.

Source 9: AitherZero Scripts (PowerShell Expertise)

170+ numbered automation scripts with structured headers (.SYNOPSIS, .DESCRIPTION, .PARAMETER, .EXAMPLE). Each script becomes an instruction/output pair: "Write a PowerShell 7+ script that configures Windows services with health monitoring" paired with the actual production script.

The Graph Advantage

Here is why this matters: a flat-file scraper sees a function. A graph harvester sees a function, its callers, its callees, its complexity, the config it reads, the tests that cover it, the memory entries that reference it, the expeditions that deployed it, and the cross-domain edges that connect it to infrastructure.

The same function in a flat dataset is one example. In the graph, it is five examples (code understanding, code review, call graph analysis, cross-domain reasoning, test coverage analysis), each teaching a different aspect of system thinking.

And because the graph infrastructure handles deduplication (the unified knowledge layer uses 0.85 cosine similarity threshold), the training data is semantically unique. No two examples teach the same pattern, even across different data sources.

The Pipeline

The automated training pipeline runs on Mon/Wed/Fri at 10 PM UTC. Four phases, fully automated:

Phase 1: Graph Harvesting. The graph harvester queries all five graph sources (code graph, memory graph, cross-domain linker, expedition graph, gateway telemetry). Each source contributes examples with full provenance -- source graph, node ID, quality score. Examples are anonymized and saved in ShareGPT format.

Phase 2: Synthetic Generation. The synthetic data generator produces examples using local reasoning models (deepseek-r1:14b) or Gemini API fallback. Five categories, 10 per category per run.

Phase 3: AitherZero Harvesting. PowerShell scripts parsed and converted to training pairs.

Phase 4: DPO Collection. Quality gate verdicts and user feedback ratings queried from the analytics pipeline, paired into preference examples for Phase 2 alignment training.

Quality filtering uses the embedding engine (in-process, no HTTP) for semantic dedup at 0.85 cosine threshold -- the same threshold the unified knowledge layer uses for knowledge dedup. Training data lineage is tracked as graph nodes in the knowledge graph for full provenance.

LoRA Configuration

Base Model: nvidia/Nemotron-Orchestrator-8B (GRPO-tuned from Qwen3-8B). This model was chosen specifically because it already excels at function calling and tool routing. We are adding system understanding to an already-excellent orchestrator, not trying to teach orchestration from scratch.

LoRA: Rank 32, alpha 64, dropout 0.05. Target modules: all attention projections (q_proj, k_proj, v_proj, o_proj) plus all FFN projections (gate_proj, up_proj, down_proj). Seven modules total. This is aggressive -- rank 32 with full projection targeting gives significantly more learning capacity than the conservative rank 16 we used in early runs. The RTX 5090's 32GB VRAM has headroom.

Trainable Parameters: ~87M trainable out of 8.2B total (~1.06%).

Training Hyperparameters: Batch size 2, gradient accumulation steps 8 (effective batch 16), learning rate 1e-4 with cosine schedule, warmup ratio 0.03, 3 epochs, bf16 enabled, gradient checkpointing enabled.

Quantization: 4-bit NormalFloat (NF4) with double quantization. NF4 is information-theoretically optimal for normally distributed weights.

Hardware: NVIDIA RTX 5090 (32GB GDDR7). Training uses approximately 16GB VRAM with r=32 QLoRA on a 7B model.

Timing: ~4-6 hours for full dataset (estimated with r=32 on curated graph data), 2 minutes merge, 5 minutes operational overhead.

Benchmark-Driven Evaluation

The evaluation pipeline uses a BenchmarkRunner with 19 held-out prompts across 5 categories:

Reasoning (5 prompts): throughput calculations, retry math, cascade effects, pagination strategies, architectural patterns
Tool use (4 prompts): multi-tool chains, error recovery, diagnosis workflows
Code review (4 prompts): security analysis, pattern detection, refactoring judgment
Orchestration (3 prompts): multi-agent design, effort routing, context management
Intent classification (3 prompts): intent decomposition, model routing decisions

Each prompt is evaluated by a local judge (deepseek-r1:14b via the model scheduler) that scores on a 0-1 scale. The judge strips <think> tags from its own reasoning before extracting the score. Pass threshold: 0.7.

The closed-loop controller compares the trained model's benchmark score against the active version. Regression threshold: 0.05. If the new model scores more than 5% worse than the current one, it auto-rolls back -- the model version store records the rollback with scores for post-mortem analysis.

The Flywheel

Every interaction with AitherOS generates more graph data.

A developer debugs a boot sequence issue. The conversation becomes an episodic memory in the memory graph. The code changes update the code graph's call chains. The cross-domain linker detects new edges between the modified code and its tests. The expedition system records the task decomposition. The quality gate's verdict becomes a DPO pair.

Next training run, the graph harvester queries all of this. The model gets better at handling similar problems. Because it is better, developers use it for more complex tasks. Those tasks generate richer graph data. The graph harvester produces higher-quality training examples.

The flywheel has a specific property: the training data is generated as a byproduct of productive work. Nobody creates training examples. The graph infrastructure captures them automatically as a side effect of system operation.

Each revolution pushes the boundary of what the local model handles independently. Today it handles ~80% of routine orchestration. The graph data from the other 20% -- the hard problems that still need frontier models -- becomes the training signal for the next iteration.

From Tool Calling to System Understanding

Consider what happens when a model that is already state-of-the-art at function calling learns system structure from graph relationships.

Before training, Nemotron-Orchestrator could produce a syntactically correct call to any API from a signature. After training on code graph call chains, it also understands that service registration triggers a topological dependency sort, that health endpoints must respond within 2 seconds, and that services in boot phase 0 must register before phase 1 services can start.

Before training, the model could select the right tool from 300+ options. After training on memory graph episodic learnings, it has seen what happens when tools fail -- the error patterns, the recovery strategies, the escalation chains that real system operators discovered through experience.

Before training, the model could route intent to the correct agent. After training on cross-domain linker edges, it understands why certain intents require certain agents -- the security implications of letting the wrong agent access certain tools, the performance impact of routing a simple query through the full reasoning pipeline.

This is not function calling. This is operational understanding built from graph-structured knowledge.

Technical Specifications

Base Model: nvidia/Nemotron-Orchestrator-8B (GRPO-tuned from Qwen3-8B)

LoRA Configuration: Rank 32, alpha 64, dropout 0.05. Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. Task type: CAUSAL_LM.

Training Data Sources:

Source	Type	Quality Signal
Code Graph	Code chunks with call graphs	Cyclomatic complexity, fan-in/fan-out
Memory Graph	Episodic/semantic/procedural	Memory strength (decay curve)
Cross-Domain Linker	Multi-domain edges	Edge confidence score
Expedition Graph	Task decomposition	Completion status, attempt count
Gateway Telemetry	Tool call chains	Session completeness
Analytics Outcomes	Intent routing	Success rate, latency
Quality Gate	DPO pairs	Approval/rejection
User Feedback	DPO pairs	Rating score
Synthetic (local)	Teacher model generated	Output length, structure
AitherZero Scripts	PowerShell patterns	Script completeness

Deduplication: Embedding engine (nomic-embed-text-v1.5, 768-dim) with 0.85 cosine similarity threshold. Same threshold the unified knowledge layer uses. In-process, no HTTP overhead.

Anonymization: The session anonymizer applies DLP redaction, home directory stripping, username hashing, and git URL anonymization before any data enters the training pipeline.

Inference: vLLM with merged LoRA weights, served behind the model scheduler for VRAM coordination and priority queuing.

Prompt Compression: A trained-model flag in the system prompt builder skips ~2,800 tokens per call by omitting layers the model has internalized (system knowledge, identity, git workflow, response format). ~15-20% throughput gain at 30 calls/min.

What Comes Next

Phase 2: DPO Alignment. Once we have sufficient preference pairs from quality gate verdicts and user feedback (~2K pairs), we apply Direct Preference Optimization on the SFT model. Config: beta=0.1, lr=5e-5, 1 epoch. Objective: train the model to prefer concise, structured, non-sycophantic responses.

Continuous harvesting. The graph harvester runs on schedule (Mon/Wed/Fri). Each run queries the latest graph state, producing training data that reflects the system's current architecture. The model stays synchronized with the codebase as it evolves.

Benchmark expansion. The current 19 prompts will grow to 200 held-out examples with golden answers. Categories will expand to include security analysis, performance diagnosis, and multi-agent coordination.

A/B testing. The model version store tracks every model version with benchmark scores. Promoted models can be compared against baselines. The closed-loop controller auto-rolls back regressions, but we also want positive A/B testing on held-out queries to measure real-world improvement.

The long-term trajectory: a local model that understands your system deeply enough that cloud API calls become the exception. Not because the local model is as generally capable as frontier models -- it is not and may never be. But because for the specific task of orchestrating your specific system, graph-structured domain knowledge beats general capability.

We built the graph infrastructure for runtime context assembly. It turns out the same infrastructure produces training data that is richer, more structured, and more semantically diverse than anything a flat-file scraper could produce. The graphs were already there. We just had to harvest them.

Enjoyed this post?

All posts Try AitherOS

Back to blog

engineeringtrainingfine-tuninginfrastructurellmknowledge-graph

Fine-Tuning a Production Orchestrator on Consumer Hardware in 28 Minutes

March 8, 202618 min readAitherium