Steerable Video Action Model

ICLR 2026 Conference Submission200 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video generation, robot manipulation
Abstract: Steerable robot policies—those conditioned on steering signals like trajectory traces—offer a promising solution for flexible, general-purpose robot control. However, most existing steerable policies are limited by their reliance on action-labeled robot data for learning to follow these steering signals. The recently proposed video-action models offer a scalable solution for incorporating additional video data by learning to jointly predict future video frames along with actions, which enables the learning of rich latent representations that capture visual dynamics and helps improve action prediction. Despite their promise, prior video-action models are not steerable, limiting their ability to generalize to out-of-distribution task specifications or novel object configurations that require new behaviors. We propose the Steerable Video Action (SVA) model, which learns to jointly predict future video frames and low-level actions while receiving guidance from end-effector trajectory traces as steering signals. To process these traces, we represent them as images, encode them using a pretrained VAE, and explicitly align the encoded tokens spatially with visual observation tokens before passing them through a transformer. We find that SVA can incorporate guidance from end-effector trajectory traces and generalize better to unseen traces outperforming baselines with and without access to trajectory traces.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 200
Loading