SPUR: Scaling Reward Learning from Human Demonstrations

Anthony Liang; Yigit Korkmaz; Jiahui Zhang; Jesse Zhang; Abrar Anwar; Sidhant Kaushik; Yufei Wang; Yu Xiang; David Held; Dieter Fox; Abhishek Gupta; Stephen Tu; Erdem Biyik

SPUR: Scaling Reward Learning from Human Demonstrations

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Jesse Zhang, Abrar Anwar, Sidhant Kaushik, Yufei Wang, Yu Xiang, David Held, Dieter Fox, Abhishek Gupta, Stephen Tu, Erdem Biyik

Published: 19 Sept 2025, Last Modified: 19 Sept 2025NeurIPS 2025 Workshop EWMEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward Models, Large Foundation Models, Robot Learning

TL;DR: We investigate how to blend together large-scale VLM backbones, progress-based rewards, and preference-based rewards into one scalable, unified reward model.

Abstract: Learning reward functions from human demonstrations is critical for scalable robot learning, yet most approaches either require impractical ground‑truth state access, costly online retraining, or yield domain‑specific models with poor transferability. We propose SPUR, a unified reward modeling framework that combines a large pre‑trained vision‑language model (VLM) backbone fine‑tuned to encode robot image sequences and language instructions, a progress‑based reward objective trained on successful demonstrations augmented with video rewind to simulate failures, and a preference‑learning objective over mismatched and rewound trajectories to enable training on failed executions without explicit progress labels. This design leverages the generalization of VLMs while integrating complementary progress and preference signals for improved robustness and generalization. Experiments on out‑of‑distribution tasks in LIBERO and Meta‑World show that each component contributes to performance gains across a set of reward metrics, and their combination achieves state-of-the-art results compared to recent baselines, demonstrating scalable training of reward models.

Submission Number: 53

Loading