ContextPRM: Leveraging Contextual Coherence for multi-domain Test-Time Scaling

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Process Reward Models
TL;DR: We shift the training of Process Reward Models from verifying domain-specific correctness to modeling domain-agnostic contextual coherence, achieving state-of-the-art multi-domain generalization.
Abstract: Process reward models (PRMs) have demonstrated significant efficacy in enhancing the mathematical reasoning capabilities of large language models (LLMs) by leveraging test-time scaling (TTS). However, while most PRMs exhibit substantial gains in mathematical domains, the scarcity of domain-specific training data and knowledge-based learning patterns limits their generalization ability when faced with other domains. To address this limitation, we shift the learning objective from verifying domain-specific knowledge to modeling domain-agnostic logical flow. Centering on \textit{contextual coherence} between chain-of-thought (CoT) steps, our approach is realized through a novel data annotation and training framework, which enhances the model's generalization capabilities across diverse domains. For instance, our resulting model, \textbf{ContextPRM}, achieves a notable 6.5\% average accuracy improvement over the majority voting baseline via weighted majority voting across nine non-mathematical domains in MMLU-Pro, including law, history, and philosophy, significantly surpassing the 2.2\% improvement from VersaPRM and 0.5\% gains from other mathematics-focused PRMs, demonstrating consistent performance across both mathematical and non-mathematical domains.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12539
Loading