Activation Reward Models for Few-Shot Model Alignment

Tianning Chai; Chancharik Mitra; Brandon Huang; Gautam Rajendrakumar Gare; Zhiqiu Lin; Assaf Arbelle; Leonid Karlinsky; Rogerio Feris; Trevor Darrell; Deva Ramanan; Roei Herzig

Activation Reward Models for Few-Shot Model Alignment

Tianning Chai, Chancharik Mitra, Brandon Huang, Gautam Rajendrakumar Gare, Zhiqiu Lin, Assaf Arbelle, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Deva Ramanan, Roei Herzig

15 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward Modeling, Few-shot learning, Reward Hacking, Preference Optimization

TL;DR: Our work features a novel few-shot reward modeling approach that incorporates activation steering as well as a novel few-shot reward hacking benchmark.

Abstract: Aligning Large Language Models (LLMs) and Large Multimodal Models (LMMs) to human preferences is a central challenge in improving the quality of the models' generative outputs for real-world applications. A common approach is to use reward modeling to encode preferences, enabling alignment via post-training using reinforcement learning. However, traditional reward modeling is not easily adaptable to new preferences because it requires a separate reward model, commonly trained on large preference datasets. To address this, we introduce $\textbf{Activation Reward Models (Activation RMs)}$---the first method to apply mechanistic interpretability for to construct well-aligned reward models using only few-shot examples and no additional model finetuning. Activation RMs outperform existing few-shot reward modeling approaches such as LLM-as-a-judge with in-context learning, voting-based scoring, and token probability scoring on standard reward modeling benchmarks. Furthermore, we demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, showing that our approach is both resource-efficient and robust to noisy exemplars, highlighting their utility for safety-critical applications. Toward this end, we propose PreferenceHack, a novel few-shot setting benchmark, the first to test reward models on reward hacking in a paired preference format. Finally, we show that Activation RM achieves state-of-the-art performance on this benchmark, surpassing even GPT-4o.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6164

Loading