A$^2$RM: Adversarial-Augmented Reward Model

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward Model, Adversarial Augment
Abstract: Reward models (RMs) are central to aligning large language models via Reinforcement Learning. However, trained on static and finite preference datasets, they tend to learn spurious correlations rather than semantic preferences, making them vulnerable to out-of-distribution inputs and contributing to reward hacking. To overcome this, we propose Adversarial-Augmented Reward Model (A$^2$RM), a framework that systematically exposes and patches these vulnerabilities. A$^2$RM employs an adversarial generator, optimized with reinforcement learning, to transform standard preference data into inverted pairs.Within these pairs, an adversarial response is crafted to be semantically identical to the human preferred answer but scored by the RM as lower than the rejected response, directly creating a conflict between semantic content and the reward signal. By dynamically augmenting the training set with these identified high-information adversarial responses, A$^2$RM iteratively refines the reward model, compelling it to learn more robust preference representations. Comprehensive experiments validate that A$^2$RM achieves a 51.1\% average higher accuracy on adversarial responses, while maintaining comparable performance on original ones.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 7083
Loading