Distilling Reasoning: How We're Using MiroThinker to Compete in the NVIDIA Nemotron Challenge
The NVIDIA Nemotron Model Reasoning Challenge dropped last week. $50K in prizes, DGX Spark hardware, and a clear brief: make Nemotron-3-Nano-30B reason better. Submit a LoRA adapter, rank 32 max, evaluated on accuracy with answers in \boxed{} format.
We're entering. Here's how.
The Setup
The competition gives everyone the same starting point: Nemotron-3-Nano-30B, a hybrid Mamba-2 + Transformer MoE model. 30B total parameters, roughly 3B active per token. It already scores 95.4% on MATH500 and 78.5% on AIME25 with reasoning mode enabled.
The challenge is improving structured reasoning on a novel benchmark from NVIDIA Research. Allowed techniques: prompting, synthetic data, RL, fine-tuning — whatever you want, as long as the output is a rank-32 LoRA.
Our Edge: Teacher Distillation via MiroThinker
We already run MiroThinker-1.7-mini as a cloud reasoning teacher in our training pipeline. It's a 30B MoE model trained for 300 tool calls, 256K context, with 82.7% on GAIA and 74% on BrowseComp. Apache 2.0 licensed.
The distillation strategy is straightforward:
- MiroThinker generates reasoning traces for competition-style math, logic, and algorithmic problems
- Every answer gets formatted with
\boxed{}notation (the competition's extraction format) <think>tags wrap the reasoning chain — matching Nemotron's "detailed thinking on" mode- Quality filtering drops anything without genuine multi-step reasoning
- The filtered corpus trains a LoRA adapter on Nemotron-3-Nano-30B
This is the same approach NVIDIA themselves used to win the AI Math Olympiad — synthetic solutions from larger models, distilled into smaller ones. The difference is our teacher data pipeline is already automated.
The Data Mix
We're combining three sources:
MiroThinker Teacher Traces (~200 examples)
- Competition math: combinatorics, number theory, algebra, geometry
- Logic puzzles: deduction, constraint satisfaction, probability
- Code reasoning: algorithmic complexity, graph problems, dynamic programming
- All with
<think>reasoning chains and\boxed{}final answers
Public Datasets (~4000-6000 examples)
- GSM8K: grade school math (establishes baseline reasoning patterns)
- MATH (hendrycks): competition-level problems across 7 categories
- MetaMathQA: augmented math with multiple solution paths
STaR Rationalization
- Failed reasoning traces from our existing pipeline
- MiroThinker generates correct reasoning chains for failures
- High signal-to-noise ratio: the model specifically learns where reasoning goes wrong
Training Config
base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B
lora:
rank: 32 # Competition max
alpha: 64 # 2x rank
dropout: 0.05
targets: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
training:
epochs: 3
batch_size: 2
grad_accumulation: 8 # Effective batch 16
learning_rate: 2e-5
scheduler: cosine
warmup: 10%
max_seq_length: 4096
bf16: true
We're targeting the Best Data/Synthetic Data Method open contribution award in addition to the main leaderboard. The entire pipeline is automated — from MiroThinker cloud node wake-up to Kaggle notebook generation.
The Pipeline
MiroThinker (cloud GPU)
|-- Generate math/logic/code reasoning traces
|-- Format with \boxed{} answers
|-- Quality filter (min 200 chars, must have <think> tags)
v
Public Datasets (GSM8K, MATH, MetaMathQA)
|-- Harvest + format for Nemotron chat template
|-- Ensure \boxed{} extraction works
v
Merge + Dedup
|-- Combined JSONL corpus
|-- Semantic dedup via EmbeddingEngine
v
Kaggle Notebook
|-- 4-bit QLoRA on G4 VM (RTX PRO 6000)
|-- TRL + PEFT training
|-- Package LoRA as submission.zip
v
Submit to Kaggle
Everything runs through POST /training/pipeline/competition/nemotron/run in Genesis, or via the MCP tool run_mirothinker_teacher(categories={"competition_math": 100}).
What Makes This Different
Most competitors will use the same public datasets (GSM8K, MATH) and similar LoRA recipes. The differentiators:
-
Teacher quality: MiroThinker-1.7-mini is a stronger reasoning model than most people will use as a teacher. Its traces include genuine multi-step reasoning, not just correct answers.
-
Failure-aware training: STaR rationalization means we specifically train on where reasoning breaks down, not just successful paths.
-
Automated iteration: Our pipeline can regenerate data, retrain, and submit in one command. Fast iteration matters when the midpoint cutoff is April 9.
-
Answer format discipline: We enforce
\boxed{}formatting at data generation time, not as a post-processing afterthought. The model learns to produce correctly formatted answers from the training data itself.
Timeline
- Now (March 20): Pipeline built, initial data generated
- Week 1-2: Generate teacher corpus, initial LoRA training, first submissions
- April 9 (midpoint): Target top-10% for midpoint prize eligibility
- April-June: Iterate on data mix, explore RL post-training, optimize for final leaderboard
The midpoint cutoff for the Open Progress Prize ($5K + DGX Spark) is 19 days away. That's our first target.
Open Source
Per competition rules, prize eligibility requires a public notebook and solution write-up. We'll publish the full pipeline, data generation scripts, and training recipe. The MiroThinker teacher traces will be available as a HuggingFace dataset.
This aligns with the competition's stated goal: "strengthening open reasoning workflows that others can study, reuse, and extend."
Competition link: NVIDIA Nemotron Model Reasoning Challenge
Built with AitherOS — an AI agent operating system that trains its own weights.