Keywords: Large Language Models, Reasoning Diversity, Data Curation
TL;DR: Training with a "one problem, multiple solutions" paradigm increases output diversity and boosts Test-Time Scaling (TTS) performance.
Abstract: While Test-Time Scaling (TTS) effectively enhances the reasoning capabilities of Large Language Models (LLMs), its potential is often bottlenecked by low output diversity. This limitation raises questions about the standard one problem, one solution (1P1S) fine-tuning paradigm, which, by rewarding a single canonical answer, may encourage models to overfit to specific reasoning paths. To address this, we argue that adopting a one problem, multiple solutions (1PNS) training paradigm is crucial for cultivating reasoning diversity and unlocking the full potential of LLM reasoning. However, a central challenge of this paradigm lies in quantifying the semantic difference between complex, multi-step reasoning paths. To address this, we introduce Reasoning Path Divergence (RPD), a novel, fine-grained metric that operates at the step-level of Long Chain-of-Thought solutions. Using RPD, we curate a training set composed of maximally diverse solutions for each problem. Experiments with Qwen3-4B-Base demonstrate that training on our RPD-curated data significantly enhances output diversity and yields substantial gains in pass@k performance. Specifically, our 1PNS approach surpasses the 1P1S baseline by an average of 2.80\% on pass@16 across challenging math benchmarks, with the improvement reaching 4.99\% on AIME24, making Test-Time Scaling more effective.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9392
Loading