SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Qianzhong Chen; Justin Yu; Mac Schwager; Pieter Abbeel; Fred Shentu; Philipp Wu

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Fred Shentu, Philipp Wu

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Imitation Learning, Reward Modeling, Robotics Manipulation

Abstract: Large-scale robot learning has made progress on complex manipulation tasks, yet long-horizon, contact-rich problems—especially those involving deformable objects—remain challenging due to inconsistent demonstration quality. We propose a stage-aware, video-based reward modeling framework that jointly predicts task stage and fine-grained progress, using natural-language subtask annotations to derive consistent labels across variable-length demonstrations. This avoids the brittleness of frame-index-based labeling and provides stable supervision even in tasks like T-shirt folding. Our reward model is robust to demonstration variability, generalizes to out-of-distribution scenarios, and improves downstream policy training. Building on it, we introduce Reward-Aligned Behavior Cloning (RA-BC), which filters and reweights demonstrations based on reward estimates. Experiments show that our method significantly outperforms baselines in both real-world rollouts and human validation. On T-shirt folding, we achieve 83\% success from the flattened state and 67\% from the crumpled state, compared to 8\% and 0\% with vanilla BC. Overall, our results highlight reward modeling as a scalable and annotation-efficient solution for long-horizon robotic manipulation. Project website: https://qianzhong-chen.github.io/sarm.github.io/.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 3609

Loading