Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

Zhengran Ji; Boyuan Chen

Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

Zhengran Ji, Boyuan Chen

Published: 06 Oct 2025, Last Modified: 06 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating reward models across a population of users to form consensus preferences. Across three challenging environments, Pref-GUIDE significantly outperforms scalar-feedback baselines, with the voting variant exceeding even expert-designed dense rewards. By reframing scalar feedback as structured preferences with population feedback, Pref-GUIDE, offers a scalable and principled approach for harnessing human input in online reinforcement learning.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: These are the changes we made for the camera-ready version: 1. We changed Figure 3 of the paper by adding the visualization of the trajectories whose embeddings are close to each other in the embedding space. We also added the sentence to address the limited human feedback in Bowling, as reviewers suggested. 2. We added a paragraph in the conclusion section that talks about the limitations and the future direction of the work, as reviewers suggested. 3. We added more visualizations of the trajectories whose embeddings are close to each other in the embedding space and showed that their corresponding human feedback varies in the appendix. 4. We added some experimental details and reported the hardware we used during training.

Video: https://www.youtube.com/watch?v=r9Cd7eEdLWE

Code: https://github.com/generalroboticslab/Pref-GUIDE

Assigned Action Editor: ~Matthew_Walter1

Submission Number: 5412

Loading