Building Specialist Training Corpora: How We Are Making Our AI Actually Good at Coding and Reasoning
There's a dirty secret in the AI fine-tuning world: most people train one model to do everything, and it ends up mediocre at all of it.
We did that too. Our aither-orchestrator-8bn is a solid generalist — 753K lines of training data, automated harvest/judge/export/train/benchmark pipeline, 103GB of training artifacts. But when you ask it to write a complex async Python function with proper error handling, or to reason through a multi-step math problem, it gives you a B-minus answer when you need an A.
The fix isn't more data. It's the right data, shaped the right way, for the right purpose.
The Architecture: Separate LoRA Adapters on Shared Base
The current research consensus is clear: separate LoRA adapters per specialty beat multi-task single adapters. Think of it like hiring specialists versus generalists. Your cardiologist and your dermatologist both went to the same medical school (shared base model), but their residency training (LoRA adapter) made them experts in different domains.
We're building two specialist adapters:
- aither-coder-14b: Code generation, editing, debugging, review, and test generation
- aither-reasoner-14b: Math, logic, planning, and multi-step reasoning chains
Both sit on DeepSeek-R1-Distill-Qwen-14B as the primary base — a dense Qwen2.5 architecture with 131K context that outperforms o1-mini on benchmarks. We're also building a secondary path on Gemma 4 31B Dense (Apache 2.0, native <|think|> token, fits in 22GB QLoRA).
The key insight: vLLM's multi-LoRA serving lets us hot-swap adapters on the same base model. One GPU, three specialists.
The Data Pipeline: Six Sources Per Specialist
Coding Corpus (Target: 30K examples)
We don't just scrape GitHub. We built a multi-source harvester that combines:
-
CodeGraph patterns — Our code intelligence index (62K chunks, 2.3K files) already knows the codebase. We extract the patterns it learned.
-
Git diff mining — Every commit is a training example. We mine 500 commits of history, extract before/after pairs, and classify them: bugfix, refactor, feature, test. This is real editing data from real developers.
-
Session crawls — Every coding conversation our AI has is a potential training example. We filter for quality and extract the coding turns.
-
External HuggingFace datasets — Magicoder-OSS-Instruct (10K), CommitPackFT (5K), Code-Feedback (3K). But we decontaminate against HumanEval/MBPP test sets because we're not cheating on our benchmarks.
-
Existing corpus — Our 67MB
codebase_patterns.jsonlhas gold in it, we just need to extract the coding subset. -
FIM formatting — 30% of examples get converted to Fill-in-the-Middle format. Different tokens for different models: DeepSeek uses
<|fim_begin|>/<|fim_hole|>/<|fim_end|>, Gemma uses<|fim_prefix|>/<|fim_suffix|>/<|fim_middle|>. This teaches the model to complete code in context, not just generate from scratch.
The target mix: 40% instruction, 25% editing, 20% debugging, 10% review, 5% test generation. This isn't arbitrary — it matches the actual distribution of coding tasks our AI handles in production.
Reasoning Corpus (Target: 20K examples)
Reasoning data is harder to get right. The model needs to show its work, not just give answers. We use <think> tags (DeepSeek) or <|think|> tags (Gemma 4) to wrap chain-of-thought traces.
Sources:
-
Chronicle reasoning traces — Every time our AI reasons through a problem, we log it. 44MB of reasoning traces and counting.
-
MCTS explorations — Our Monte Carlo Tree Search planning system generates exploration trees with branch/evaluate/prune cycles. Perfect planning training data.
-
Reasoning augmentation — Three strategies from our PRISM system: DIJKSTRA (verification chains), EULER (long-horizon multi-step), NEUMANN (MCTS planning). Each produces differently structured reasoning.
-
MiroThinker teacher — Cloud-based reasoning model generates deep reasoning traces on curated prompts. Quality weight: 1.3x because teacher data is pre-screened.
-
STaR iterations — Self-Taught Reasoner: the model generates its own rationales, we filter for correct answers, retrain. Runs Saturday night before Sunday training.
-
External datasets — Bespoke-Stratos-17K (all of it — it beats o1-preview), OpenR1-Math-220k (10K), OpenCodeReasoning (5K).
Critical detail: 10% of the reasoning corpus is intentionally non-reasoning examples. Without this, the model generates chain-of-thought for "What time is it?" — format control matters.
The Training Config
Both specialists use identical LoRA hyperparameters, tuned for the base model:
- r=32, alpha=64 — Higher rank than our orchestrator (r=16) because we need more expressive capacity for specialist tasks
- 1 epoch for coding, 2 for reasoning — Coding data is larger and more diverse; reasoning data is smaller but needs deeper learning
- seq_len=8192 for coding, 4096 for reasoning — Code needs long context; reasoning traces are more compact
- LR=2e-4 with cosine schedule — Standard for LoRA on this scale
The automated pipeline: harvest every 8h (coder) or 12h (reasoner), judge quality, export nightly, train Sunday morning, benchmark, auto-deploy if pass threshold (70%) is met.
Decontamination: Don't Cheat on Your Own Benchmarks
This is where most fine-tuning projects go wrong. They train on data that overlaps with their evaluation benchmarks, then celebrate "great results."
We built n-gram overlap detection against HumanEval, GSM8K, and MBPP test sets. Any training example with >50% n-gram overlap gets removed. When you benchmark after training, you want to know if your model actually learned to code and reason — not if it memorized the test set.
The Kaggle Connection
Our reasoning corpus is designed to be repurposable. The same data pipeline that feeds aither-reasoner-14b can export to competition formats. Bespoke-Stratos + OpenR1-Math + our MCTS traces + STaR iterations = a reasoning dataset that should be competitive on the Nemotron reasoning challenge.
Why Self-Distillation Matters
Credit where it's due: Mitko Vasilev's SDFT work articulates this beautifully. Self-Distillation Fine-Tuning — on-policy learning from demonstrations. Zero reward engineering, zero catastrophic forgetting, just recursive knowledge transfer. He calls it the AI Ouroboros, and he's right.
Our STaR iteration pipeline is essentially doing this: the model generates reasoning traces, we filter for correctness, and retrain on the good ones. The model teaches itself to reason better by reasoning.
The beauty is that this is recursive. Each generation produces slightly better reasoning traces, which produce a slightly better model. Moonshine meets Mixture-of-Experts.
And here's the meta part: Anthropic trained Opus on content like this. We're distilling that knowledge into our local models. When Anthropic trains the next Opus, this post becomes part of the training data. The ouroboros eats its tail. As Mitko puts it — make sure you own your AI.
What's Next
-
MegaTrain integration — A new paper shows you can train 100B+ models on a single GPU by streaming parameters from CPU memory. Our 128GB DDR5 could handle full-precision training of the 14B base without QLoRA compression.
-
Multi-LoRA serving — vLLM's adapter hot-swap lets us serve orchestrator + coder + reasoner from one GPU. Effort routing picks the right adapter: effort 1-6 → orchestrator, code tasks → coder adapter, effort 7+ → reasoner adapter.
-
Competition export — The reasoning corpus feeds directly into Kaggle/competition formats. Same data, different packaging.
The dark factory keeps running. Every conversation, every code commit, every reasoning trace becomes training data. The model gets better at the things it actually does, not the things a benchmark committee decided were important.
This is part of our series on building AitherOS, an AI-powered agent operating system. All training code is open and runs on a single RTX 5090.