Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment

Yunyi Shen; Hao Sun; Jean-Francois Ton

Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment

Yunyi Shen, Hao Sun, Jean-Francois Ton

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a last-layer Fisher information-based method to choose comparisons for reward modeling.

Abstract: Building neural reward models from human preferences is a pivotal component in reinforcement learning from human feedback (RLHF) and large language model alignment research. Given the scarcity and high cost of human annotation, how to select the most informative pairs to annotate is an essential yet challenging open problem. In this work, we highlight the insight that an ideal comparison dataset for reward modeling should balance *exploration of the representation space* and make *informative comparisons* between pairs with moderate reward differences. Technically, challenges arise in quantifying the two objectives and efficiently prioritizing the comparisons to be annotated. To address this, we propose the Fisher information-based selection strategies, adapt theories from the *classical experimental design* literature, and apply them to the final linear layer of the deep neural network-based reward modeling tasks. Empirically, our method demonstrates remarkable performance, high computational efficiency, and stability compared to other selection methods from deep learning and classical statistical literature across multiple open-source LLMs and datasets. Further ablation studies reveal that incorporating cross-prompt comparisons in active reward modeling significantly enhances labeling efficiency, shedding light on the potential for improved annotation strategies in RLHF. Code and embeddings to reproduce all results of this paper are available at https://github.com/YunyiShen/ARM-FI/.

Lay Summary: In training AI systems to align better with human values, especially large language models, researchers often rely on human feedback to teach the AI what good behavior looks like. But collecting this feedback is expensive and time-consuming, so it's important to carefully choose which examples we ask humans to label. This study explores how to pick the most useful pairs of AI responses for human comparison. Ideally, we want a diverse set of examples that are informative--but not too obvious or too subtle--so they teach the model the most about human preferences. To do this, we adapt ideas from classical statistical experimental design literature to select which examples to label, using Fisher information which measures how a data point could change our parameter estimation. Our method focuses on the final layer of the neural network that learns human preferences. The results show that our approach is not only accurate but also fast and stable, outperforming other methods on several open-source language models and datasets. We find that comparing answers to different prompts--not just the same prompt--can make the training process even more efficient, offering new directions for improving how we teach AI systems with human feedback.

Link To Code: https://github.com/YunyiShen/ARM-FI

Primary Area: Deep Learning->Large Language Models

Keywords: Reward modeling, active learning, LLM alignment

Submission Number: 12049

Loading