Keywords: Input-Time Scaling, train-test co-design, reasong, less is more phenomenon, training with random selection and augmentation, exceptional and unexpected math reasoning performance
TL;DR: We introduce the Input-Time Scaling paradigm and the train-test co-design phenomenon. Quality and quantity intuitives may lower the performance. SFT with 1k randomly selected and augmented data can get 90.0% on AIME24 and 80.0% on AIME25 (32B model).
Abstract: Large Language Models (LLMs) excel at mathematical reasoning, traditionally requiring high-quality data and extensive training. Recent work reveals a **Less-Is-More** phenomenon where small, curated datasets match resource-intensive approaches. In this work, we systematically investigate quality constraints by adding controlled noise and comparing datasets qualities. Noise levels are controlled via context relevance to original queries. Counterintuitively, mixing relevant and irrelevant contexts yields optimal results, and performance gains emerge only when context concatenation applies consistently, not necessarily the same type, across training and inference. Token distribution analysis shows persona strategies increase thinking tokens while reducing response length. We term the above phenomenon **training-testing co-design**. Comparing dataset qualities, high-quality data excels on weaker models and easier questions, while low-quality data achieves overall higher scores, especially on hard questions with capable models. Building on these insights, we propose our method, applying small, low-quality data to capable models via training-testing co-design. The process distinguishes it from supervised fine-tuning or test-time scaling, which we term it **Input-Time Scaling**. Our method achieves 76.7\% pass@1 on AIME24/AIME25 using Qwen2.5-32B-Instruct, with DeepSeek-R1-Distill-Qwen-32B reaching 90.0\%/80.0\%. We are open-sourcing our datasets, pipelines, evaluation results, and checkpoints to facilitate reproducibility and further research.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 1185
Loading