Fixing the Broken Compass: Diagnosing and Improving Inference-Time Reward Modeling

Jiachun Li; Pengfei Cao; Yubo Chen; Jiexin Xu; Huaijun Li; Xiaojian Jiang; Kang Liu; Jun Zhao

Fixing the Broken Compass: Diagnosing and Improving Inference-Time Reward Modeling

Jiachun Li, Pengfei Cao, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, Jun Zhao

10 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model analysis & interpretability, Reasoning, Inference-time Scaling

TL;DR: We analyze key failure modes of reward model-based inference and propose CRISP, an optimized cluster-based, prefix-guided algorithm that improves reasoning accuracy.

Abstract: Inference-time scaling techniques have shown promise in enhancing the reasoning capabilities of large language models (LLMs). While recent research has primarily focused on training-time optimization, our work highlights inference-time reward model (RM)-based reasoning as a critical yet overlooked avenue. In this paper, we conduct a systematic analysis of RM behavior across downstream reasoning tasks, revealing three key limitations: (1) RM can impair performance on simple questions, (2) its discriminative ability declines with increased sampling, and (3) high search diversity undermines RM performance. To address these issues, we propose **CRISP** (Clustered Reward Integration with Stepwise Prefixing), a novel inference-time algorithm that clusters generated reasoning paths by final answers, aggregates reward signals at the cluster level, and adaptively updates prefix prompts to guide generation. Experimental results demonstrate that CRISP significantly enhances LLM reasoning performance, achieving up to **5\%** accuracy improvement over other RM-based inference methods and an average of **10\%** gain over advanced reasoning models.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 16749

Loading