CUDA: Capturing Uncertainty and Diversity in Preference Feedback Augmentation

TMLR Paper6999 Authors

13 Jan 2026 (modified: 05 Jun 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Preference-based Reinforcement Learning (PbRL) effectively addresses reward design challenges in Reinforcement Learning and facilitates human-AI alignment by enabling agents to learn human intentions. However, optimizing PbRL critically depends on abundant, diverse, and accurate human feedback, which is costly and time-consuming to acquire. Existing feedback augmentation methods aim to alleviate the scarcity of human preference feedback. However, they often neglect diversity, primarily generating feedback for high-confidence trajectory pairs with extreme differences. This approach leads to a biased augmented set that incompletely represents human preferences. To overcome this, we introduce Capturing Uncertainty and Diversity in preference feedback Augmentation (CUDA), a novel approach that comprehensively considers both uncertainty and diversity. CUDA enhances augmentation by employing ensemble-based uncertainty estimation for filtering and extracting feedback from diverse clusters via bucket-based categorization. These two mechanisms enable CUDA to obtain diverse and accurate augmented feedback. We evaluate CUDA on MetaWorld and DMControl offline datasets, demonstrating significant performance improvements over various offline PbRL algorithms and existing augmentation methods across diverse scenarios.
Submission Type: Regular submission (no more than 12 pages of main content)
Code: https://github.com/lockcept/CUDA
Assigned Action Editor: ~Zhongwen_Xu1
Submission Number: 6999
Loading