Interpretable Reward Modeling with Active Concept Bottlenecks

Published: 14 Jun 2025, Last Modified: 19 Jul 2025ICML 2025 Workshop PRALEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward Modeling, Concept Bottleneck Models, Active Learning, Interpretability in RLHF, Acquisition Functions
TL;DR: This paper presents CB-RM, an interpretable reward modeling framework that uses concept bottlenecks and active learning with Expected Information Gain to improve sample efficiency and transparency in RLHF settings.
Track: Short Paper (up to 4 pages)
Abstract: We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on UltraFeedback, our method outperforms baselines in interpretability and sample efficiency, marking a step toward more transparent, auditable, and human-aligned reward models.
Format: We have read the camera-ready instructions, and our paper is formatted with the provided template.
De-Anonymization: This submission has been de-anonymized.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 16
Loading