Keywords: robotics, reward modeling, reinforcement learning, datasets, benchmarking, vision-language models
Abstract: A well-designed reward is critical for effective reinforcement learning-based policy improvement. In real-world robotic domains, obtaining such rewards typically requires either labor-intensive human labeling or relying on brittle hand-crafted objectives. Vision-language models (VLMs) have shown promise as automatic reward models, yet their effectiveness on real robot tasks is poorly understood. In this work, we aim to close this gap by introducing (1) \textbf{RoboReward}, a robotics reward dataset and benchmark built on large-scale real-robot corpora from Open X-Embodiment (OXE) and RoboArena, and (2) vision-language reward models trained on this dataset. Because OXE lacks failure examples, we propose counterfactual relabeling that turns successful episodes into calibrated \emph{negative} and \emph{near-miss} examples for the \emph{same} video. Using this framework, we produce an extensive training and evaluation dataset, which spans diverse tasks and embodiments and enables systematic evaluation of whether state-of-the-art VLMs can provide reliable rewards for robotics. Our evaluation of the leading open-weight and proprietary VLMs reveals that no model excels in all tasks, highlighting substantial room for improvement. We then train 3B- and 7B-parameter models that outperform much larger VLMs in assigning rewards for short-horizon robotic tasks. Finally, we deploy the 3B-parameter reward VLM in real-robot reinforcement learning and find that it improves policy learning over the base 3B model by a large margin.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 22010
Loading