Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Wei Du; Shubham Toshniwal; Branislav Kisacanin; Sadegh Mahdavi; Ivan Moshkov; George Armstrong; Stephen Ge; Edgar Minasyan; Feng Chen; Igor Gitman

Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Wei Du, Shubham Toshniwal, Branislav Kisacanin, Sadegh Mahdavi, Ivan Moshkov, George Armstrong, Stephen Ge, Edgar Minasyan, Feng Chen, Igor Gitman

Published: 17 Jun 2026, Last Modified: 21 Jun 2026ICML 2026 AI4Math Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mathematical Reasoning, Synthetic data, Multi-Mode

Abstract: High-quality mathematical reasoning supervision requires diverse reasoning styles, long-form traces, and effective tool integration, capabilities that existing datasets provide only in limited form. Leveraging the multi-mode generation ability of \texttt{gpt-oss-120b}, we introduce \dataset{}, a large-scale mathematical reasoning dataset containing 7.5M solution traces across high, medium, and low reasoning modes, each available both with and without Python tool-integrated reasoning (TIR). The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems, combining structured competition tasks with diverse real-world mathematical queries. We conduct controlled evaluations to assess the dataset’s quality. \dataset{} consistently outperforms the original \texttt{OpenMathReasoning} on matched AoPS problems, and incorporating StackExchange-Math substantially improves robustness and generalization, especially on HLE-Math, while preserving accuracy on math-competition benchmarks. To support efficient long-context training, we develop a sequential bucketed strategy that accelerates 128K context-length fine-tuning by 2–3× without significant accuracy loss compared to the full-length training. To verify the scalability of our supervision, we further perform experiments on \texttt{Qwen3-8B} and \texttt{Qwen3-30B-A3B}, showing that both models converge to similar final accuracy under our full context training recipe. Overall, \dataset{} provides diverse, high-quality, and scalable reasoning supervision, enabling state-of-the-art performance, including 100\% \texttt{maj@16} accuracy on AIME 2024/2025 for both \texttt{Qwen3-8B} and \texttt{Qwen3-30B-A3B} with Python TIR.

Submission Number: 4

Loading