CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0
TL;DR: We propose a contrastive learning method for offline PbRL that integrates preference information into the trajectory embedding space to select unambiguous queries.
Abstract: Preference-based reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL’s real-world applicability. To address this, we propose an offline PbRL method: Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY), which learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart, thus facilitating the selection of more unambiguous queries. Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.
Lay Summary: When training AI systems with human feedback, a common challenge is that people often find it difficult to choose between similar options. For instance, selecting a favorite painting from two nearly identical artworks can be frustrating and unproductive. Our method, CLARIFY, addresses this by focusing on two key concepts. First, we train the AI to identify meaningful differences between options, like distinguishing whether a robot arm successfully turns a dial or merely waves near it. Second, CLARIFY eliminates ambiguous pairings, akin to a teacher avoiding tricky questions in favor of clear comparisons, thereby enhancing the efficiency and effectiveness of feedback. In our tests, AI systems trained using this approach performed tasks such as opening drawers or walking more effectively, while reducing unclear feedback by 20-40%. This method acts as a "common sense guide" for AI, improving its understanding of human preferences without unnecessary confusion.
Primary Area: Reinforcement Learning
Keywords: Preference-based RL, contrastive learning
Submission Number: 2852
Loading