Learning from Peers in Reasoning Models

Learning from Peers in Reasoning Models

ICLR 2026 Conference Submission11907 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Large Reasoning Models

TL;DR: To fix the "Prefix Dominance Trap" where a bad start derails LRM reasoning, our LeaP method lets parallel reasoning paths share summaries, significantly improving error correction and allowing models to outperform even larger counterparts.

Abstract: Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the *"Prefix Dominance Trap"*. This phenomenon indicates that the self-correction ability of LRMs is fragile and can be easily derailed by a poor start. This fragility motivates us to **look beyond internal self-correction**. Inspired by psychological findings that peer interaction can promote correction ability without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. LeaP enables reasoning paths to periodically (every T tokens) summarize and share intermediate reasoning via a routing mechanism, thereby incorporating peer insights. For smaller models that may inefficiently follow summarization and reflection instructions, we introduce fine-tuned **LeaP-T** models. Experiments on benchmarks including AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond demonstrate substantial improvements with LeaP. For example, QwQ-32B with LeaP achieves nearly 5 absolute points higher than its baseline on average and surpasses DeepSeek-R1-671B on three math benchmarks by an average of 3.3 points. The benefits of LeaP also generalize to other domains, such as logic puzzles on the ZebraLogic benchmark. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals that LeaP provides robust error correction through timely peer insights and exhibits strong error tolerance. Code will be open-sourced.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 11907

Loading