The Climb Carves Wisdom Deeper Than the Summit: On the Importance of Reasoning Patterns

18 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Effective Reasoning Patterns, Post-training, Large language models, Noisy rewards
Abstract: Unlike typical RL studies on verifiable tasks like math, we investigate the more practical challenge of noisy rewards from non-verifiable, real-world tasks. We begin by artificially injecting noise (flipping rewards) into verifiable tasks (e.g., math and question answering) to gain some insights. Surprisingly, we found that rewarding a large portion of outputs with incorrect answers does not hinder the acquisition of effective reasoning abilities. Thus, we hypothesize that the reasoning process itself must be worth rewarding. We validate this hypothesis by rewarding only the appearance of key reasoning phrases—namely, the Reasoning Pattern Reward (RPR), such as “first, I need to”—without verifying the correctness of answers. Under this setting, the model achieved peak downstream performance comparable to those trained with clean verification rewards. Recognizing the importance of the reasoning process, we developed a core method that uses RPR to calibrate noisy reward models in open-ended NLP tasks. By incorporating RPR, we effectively mitigate potential false negatives in reward signals, thereby enhancing the LLM’s reasoning capabilities and evaluation performance on such tasks. Our findings are validated across both Qwen and Llama model series. These findings provide new insights for advancing post-training techniques.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11862
Loading