In-Context Bayesian Reward Modeling for Test-Time Steerability

Jiwoo Hong; Shao Tang; Zhipeng Wang; Aman Gupta

In-Context Bayesian Reward Modeling for Test-Time Steerability

Jiwoo Hong, Shao Tang, Zhipeng Wang, Aman Gupta

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reward modeling, large language model, language model post-training

TL;DR: We propose a novel variational reward modeling objective that allows the reward model to dynamically learn the implicit preference standards via in-context preference demonstration with calibrated confidence.

Abstract: Reward models (RMs) are central to aligning language models via reinforcement learning (RL), yet conventional classifier RMs are static after training and tied to their training data distribution. This limits generalization to unseen preference types, which is an increasingly salient need with the emergence of verifiable rewards. To address this gap, we introduce **Variational In-Context Reward Modeling (ICRM)**, which casts reward modeling as amortized variational inference over a latent preference probability conditioned on few-shot, in-context preference demonstrations, with a conjugate Beta prior on the Bradley–Terry model. ICRM employs a two-headed regressor that decouples a preference mean ($\mu$) from a confidence factor ($\tau$), jointly parameterizing a Beta posterior given demonstrations and enabling **test-time steerability of RMs**. On reward model benchmarks, a *fixed* ICRM improves accuracy simply by increasing demonstrations from $1$ to $8$, achieving $63.7\% \rightarrow 95.6\%$ on the Focus subset in Reward Bench 2, for instance. In reinforcement learning with verifiable reward (RLVR) experiment for math reasoning, ICRM used as the reward yields faster and higher accuracy improvements than a verifier-based reward, reaching stronger performance with $50\%$ of training problems compared to the verifier-based reward. Finally, we provide theoretical guarantees that the global minimizer of our loss admits finite confidence and show analytically how KL regularization tempers over-optimization. Together, ICRM offers a practical and principled framework for a test-time steerable reward model that generalizes beyond the training distribution.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23719

Loading