Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning

ICLR 2026 Conference Submission19772 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Preference-based Reinforcement Learning, Offline Reinforcement Learning, Feedback Efficiency, Optimal Transport, Video Foundation Models, Semi-supervised Learning, Robotics
TL;DR: A semi-supervised preference learning method that use optimal transport over embeddings of the video foundation model to perform pseudo-labeling.
Abstract: Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL offers a promising alternative by learning reward functions from human feedback, but its scalability is hindered by the large amount of feedback required. Inspired by recent advances in Video Foundation Models (ViFMs), we present Video-based Optimal Transport Preference (VOTP), a semi-supervised preference learning framework that can learn effective reward functions from only a handful of preference labels. By leveraging optimal transport in the representation space of ViFMs for pseudolabeling, VOTP can utilize large amounts of unlabeled data for reward learning, substantially reducing the need for human supervision. Extensive experiments across locomotion and manipulation tasks show that VOTP outperforms existing PbRL methods under limited feedback. We further validate VOTP on real robotic tasks, demonstrating its ability to learn useful rewards with minimal human input.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 19772
Loading