Keywords: reward modeling, robotics, vla
Abstract: Accurately estimating task progress and deriving robust reward functions from raw video are critical for advancing reinforcement learning (RL) and robotics. While recent Reward Foundation Models (RFMs) have shown promise by fine-tuning Vision-Language Models (VLMs) on robotic datasets, leveraging existing zero-shot VLMs for this task remains difficult due to a significant lack of calibration and a tendency for temporal hallucinations. In this work, we propose SCORE, a novel prompting framework that transforms progress prediction from a black-box logit extraction task into an explicit reasoning-in-language process.
SCORE decomposes the problem into two stages: (1) grounded video description, which ensures the model focuses on task-relevant physical interactions, and (2) semantic progress reasoning, where the VLM jointly predicts a textual completion anchor and a calibrated numerical progress sequence. Our approach effectively closes the performance gap between zero-shot methods and state-of-the-art post-trained RFMs. In offline benchmarks, SCORE outperforms existing baselines in trajectory ranking and cross-task calibration. Furthermore, we demonstrate the real-world utility of SCORE by using it as a reward signal for Diffusion Steering RL (DSRL); our method enables a VLA policy to overcome strong initial biases, achieving a +90\% success rate improvement over vanilla policies. Finally, we provide an empirical scaling analysis showing that progress prediction capabilities improve significantly with each new generation of frontier VLMs, positioning SCORE as a scalable, high-performance solution for zero-shot reward modeling.
Submission Number: 19
Loading