Keywords: reward modeling, preference alignment, large language models, representation learning
TL;DR: We propose CARP, a framework that improves reward models by aligning responses with prompt intent via inverse prompt prediction. CARP significantly enhances RewardBench accuracy on Gemma-2B/9B models.
Abstract: Reward models (RMs) are central to aligning large language models (LLMs) with human preferences, yet they often overfit to spurious correlations such as response length or sycophancy. Existing approaches mainly focus on mitigating these artifacts, but overlook reinforcing the true causal link from prompt intentions to responses. We propose CARP (Causal Alignment of Reward Models via Response-to-Prompt Prediction), a framework that leverages inverse prompt prediction to measure how well a response addresses the intent embedded in its prompt. A prompt decoder is trained to estimate the original prompt embedding from a given response, and the reconstruction error defines a Semantic Alignment Score (SAS), which we use to adjust preference labels and regularize reward model training. We show theoretically that SAS isolates the prompt-to-response causal signal while filtering out spurious cues. Empirically, the prompt decoder selects shorter and less sycophantic responses with 87.7\% accuracy across math, helpfulness, and safety benchmarks. Incorporating SAS into Bradley–Terry reward model training on Gemma-2B-it and Gemma-2-9B-it leads to significant improvements in RewardBench evaluation accuracy, demonstrating CARP’s effectiveness in building more causally aligned reward models.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24595
Loading