Robust Reward Modeling via Causal Rubrics

Published: 10 Jun 2025, Last Modified: 30 Jun 2025MoFA PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reward modeling, causality
TL;DR: causal rubrics guided augmentations improve reward model robustness
Abstract: Reward models (RMs) for LLM alignment often exhibit reward hacking, mistaking spurious correlates (e.g., length, format) for causal quality drivers (e.g., factuality, relevance), leading to brittle RMs. We introduce CROME (Causally Robust Reward Modeling), a causally-grounded framework using targeted augmentations to mitigate this. CROME employs: (1) Causal Augmentations, pairs isolating specific causal attribute changes, to enforce sensitivity, and (2) Neutral Augmentations, tie-labeled pairs varying spurious attributes while preserving causal content, to enforce invariance. Crucially, augmentations target LLM-identified causal rubrics, requiring no prior knowledge of spurious factors. CROME significantly outperforms baselines on RewardBench (Avg +5.4\%, Safety +13.2\%, Reasoning +7.2\%) and demonstrates enhanced robustness via improved Best-of-N performance across RewardBench, WildGuardTest, and GSM8k.
Submission Number: 28
Loading