Early Access Preview
Back to blog
trainingcompetitionnemotronmirothinkerdistillationreasoning

Distilling Reasoning: How We're Using MiroThinker to Compete in the NVIDIA Nemotron Challenge

March 20, 20265 min readAitherOS
Share

The NVIDIA Nemotron Model Reasoning Challenge dropped last week. $50K in prizes, DGX Spark hardware, and a clear brief: make Nemotron-3-Nano-30B reason better. Submit a LoRA adapter, rank 32 max, evaluated on accuracy with answers in \boxed{} format.

We're entering. Here's how.

The Setup

The competition gives everyone the same starting point: Nemotron-3-Nano-30B, a hybrid Mamba-2 + Transformer MoE model. 30B total parameters, roughly 3B active per token. It already scores 95.4% on MATH500 and 78.5% on AIME25 with reasoning mode enabled.

The challenge is improving structured reasoning on a novel benchmark from NVIDIA Research. Allowed techniques: prompting, synthetic data, RL, fine-tuning — whatever you want, as long as the output is a rank-32 LoRA.

Our Edge: Teacher Distillation via MiroThinker

We already run MiroThinker-1.7-mini as a cloud reasoning teacher in our training pipeline. It's a 30B MoE model trained for 300 tool calls, 256K context, with 82.7% on GAIA and 74% on BrowseComp. Apache 2.0 licensed.

The distillation strategy is straightforward:

  1. MiroThinker generates reasoning traces for competition-style math, logic, and algorithmic problems
  2. Every answer gets formatted with \boxed{} notation (the competition's extraction format)
  3. <think> tags wrap the reasoning chain — matching Nemotron's "detailed thinking on" mode
  4. Quality filtering drops anything without genuine multi-step reasoning
  5. The filtered corpus trains a LoRA adapter on Nemotron-3-Nano-30B

This is the same approach NVIDIA themselves used to win the AI Math Olympiad — synthetic solutions from larger models, distilled into smaller ones. The difference is our teacher data pipeline is already automated.

The Data Mix

We're combining three sources:

MiroThinker Teacher Traces (~200 examples)

  • Competition math: combinatorics, number theory, algebra, geometry
  • Logic puzzles: deduction, constraint satisfaction, probability
  • Code reasoning: algorithmic complexity, graph problems, dynamic programming
  • All with <think> reasoning chains and \boxed{} final answers

Public Datasets (~4000-6000 examples)

  • GSM8K: grade school math (establishes baseline reasoning patterns)
  • MATH (hendrycks): competition-level problems across 7 categories
  • MetaMathQA: augmented math with multiple solution paths

STaR Rationalization

  • Failed reasoning traces from our existing pipeline
  • MiroThinker generates correct reasoning chains for failures
  • High signal-to-noise ratio: the model specifically learns where reasoning goes wrong

Training Config

base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B
lora:
  rank: 32          # Competition max
  alpha: 64         # 2x rank
  dropout: 0.05
  targets: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
training:
  epochs: 3
  batch_size: 2
  grad_accumulation: 8    # Effective batch 16
  learning_rate: 2e-5
  scheduler: cosine
  warmup: 10%
  max_seq_length: 4096
  bf16: true

We're targeting the Best Data/Synthetic Data Method open contribution award in addition to the main leaderboard. The entire pipeline is automated — from MiroThinker cloud node wake-up to Kaggle notebook generation.

The Pipeline

MiroThinker (cloud GPU)
  |-- Generate math/logic/code reasoning traces
  |-- Format with \boxed{} answers
  |-- Quality filter (min 200 chars, must have <think> tags)
  v
Public Datasets (GSM8K, MATH, MetaMathQA)
  |-- Harvest + format for Nemotron chat template
  |-- Ensure \boxed{} extraction works
  v
Merge + Dedup
  |-- Combined JSONL corpus
  |-- Semantic dedup via EmbeddingEngine
  v
Kaggle Notebook
  |-- 4-bit QLoRA on G4 VM (RTX PRO 6000)
  |-- TRL + PEFT training
  |-- Package LoRA as submission.zip
  v
Submit to Kaggle

Everything runs through POST /training/pipeline/competition/nemotron/run in Genesis, or via the MCP tool run_mirothinker_teacher(categories={"competition_math": 100}).

What Makes This Different

Most competitors will use the same public datasets (GSM8K, MATH) and similar LoRA recipes. The differentiators:

  1. Teacher quality: MiroThinker-1.7-mini is a stronger reasoning model than most people will use as a teacher. Its traces include genuine multi-step reasoning, not just correct answers.

  2. Failure-aware training: STaR rationalization means we specifically train on where reasoning breaks down, not just successful paths.

  3. Automated iteration: Our pipeline can regenerate data, retrain, and submit in one command. Fast iteration matters when the midpoint cutoff is April 9.

  4. Answer format discipline: We enforce \boxed{} formatting at data generation time, not as a post-processing afterthought. The model learns to produce correctly formatted answers from the training data itself.

Timeline

  • Now (March 20): Pipeline built, initial data generated
  • Week 1-2: Generate teacher corpus, initial LoRA training, first submissions
  • April 9 (midpoint): Target top-10% for midpoint prize eligibility
  • April-June: Iterate on data mix, explore RL post-training, optimize for final leaderboard

The midpoint cutoff for the Open Progress Prize ($5K + DGX Spark) is 19 days away. That's our first target.

Open Source

Per competition rules, prize eligibility requires a public notebook and solution write-up. We'll publish the full pipeline, data generation scripts, and training recipe. The MiroThinker teacher traces will be available as a HuggingFace dataset.

This aligns with the competition's stated goal: "strengthening open reasoning workflows that others can study, reuse, and extend."


Competition link: NVIDIA Nemotron Model Reasoning Challenge

Built with AitherOS — an AI agent operating system that trains its own weights.

Enjoyed this post?
Share