Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

Zhibin Duan; Guowei Rong; Zhuo Li; Bo Chen; Mingyuan Zhou; Dandan Guo

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

Zhibin Duan, Guowei Rong, Zhuo Li, Bo Chen, Mingyuan Zhou, Dandan Guo

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley–Terry (BT) preference model.BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.

Lay Summary: Reward models trained from human preferences play a key role in aligning large language models (LLMs), but they can be misled by noisy labels and superficial patterns such as response length or writing style. In this work, we ask how to build reward models that are more robust, more interpretable, and less likely to be “hacked” by these spurious cues. We introduce Bayesian Non-Negative Reward Model (BNRM), a new framework that combines preference learning with sparse non-negative latent factor modeling. The model first separates the reward signal into instance-specific latent components, helping it represent different sources of preference in a disentangled way. It then uses sparsity over global latent factors to automatically suppress misleading correlations and reduce bias. This two-stage design, disentangling first and debiasing second, also provides uncertainty-aware reward learning. To make the method practical for modern LLMs, we develop an efficient amortized inference network that can be trained end-to-end. Experiments show that BNRM is more resistant to reward over-optimization, works better under distribution shifts, and produces more interpretable reward structures than strong baseline methods.

Link To Code: https://github.com/GuoweiRong/Bayesian-Non-negative-Reward-Model

Primary Area: Probabilistic Methods->Bayesian Models and Methods

Keywords: Reward Model, Reward Hacking, Large Language Model, Bayesian Deep Learning, Non-negative Factor Analysis, Uncertainty, Interpretable Model

Originally Submitted PDF: pdf

Submission Number: 3987

Loading