Keywords: Reward Modeling, Reward Hacking, Alignment, Post training LLMs, RLHF
TL;DR: We introduce CROME, a novel causality-inspired technique for training reward models for LLM post-training, which achieves significantly improved reward model robustness and reduced reward hacking.
Abstract: Reward models (RMs) are fundamental to aligning Large Language Models (LLMs) via human feedback, yet they often suffer from reward hacking. They tend to latch on to superficial or spurious attributes, such as response length or formatting, mistaking these cues learned from correlations in training data for the true causal drivers of quality (e.g., factuality, relevance). This occurs because standard training objectives struggle to disentangle these factors, leading to brittle RMs and misaligned policies. We introduce CROME (Causally Robust Reward Modeling), a novel framework inspired by an explicit causal model designed to mitigate reward hacking. CROME queries an oracle LLM for rubrics that are (or the oracle deems to be) causally relevant to answering a specific prompt. Then, it employs the following synthetic targeted augmentations during training: (1) Causal Augmentations, which are pairs that differ along specific causal attributes (subset of the Oracle identified rubrics), to enforce sensitivity along each causal attribute individually, and (2) Neutral Augmentations, which are tie-label pairs varying primarily in spurious attributes, to enforce invariance along spurious attributes. Notably, our neutral augmentations are produced without any knowledge of unknown spurious factors, via question swapping and response interventions only along causal rubrics. We show that the CROME augmentation strategy using rubrics from popular LLM APIs significantly outperforms standard baselines on RewardBench, improving average accuracy by up to 5.3% and achieving gains of up to 7.1% and 12.4% in reasoning and safety. The robustness of CROME is further testified by significant gains in DPO-aligned policies and Best-of-N alignment across various benchmarks, including AlpacaEval 2.0, RewardBench, safety-focused WildGuardTest, and the reasoning-specific GSM8k.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19149
Loading