CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Process Reward Model, Length Bias, Mathematical Problem Solving
Abstract: Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: a tendency to assign higher scores to more verbose reasoning steps, regardless of their semantic content or logical validity. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD (Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias based on counterfactual reasoning and causal graph analysis through three components: (1) an explicit length-penalty module, (2) a trainable bias estimator to capture spurious length-related signals, and (3) a joint training strategy that disentangles semantic correctness from superficial length features. Extensive experiments on MATH500 and GSM-Plus show that CoLD consistently reduces reward–length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning. These results demonstrate the effectiveness and practicality of CoLD in improving the fidelity and robustness of PRMs.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 986
Loading