Inference-Time Planning with Action-Conditioned Video Models for Generalizable Robot Manipulation

Published: 26 May 2026, Last Modified: 27 May 2026Real2Sim2RealEveryoneRevisionsCC BY 4.0
Reviewer: ~Ola_Sho1
Keywords: action-conditioned video prediction, uncertainty quantification, robot manipulation, robot learning
TL;DR: We propose a novel inference-time planning framework that uses action-conditioned video models as high-fidelity world models for robotic manipulation to improve generalization of behavior-cloned policies.
Abstract: We propose **VidPlan**, an inference-time planning framework that uses action-conditioned video models as high- fidelity world models for robot manipulation. In contrast to state- of-the-art (SOTA) generalist policies learned via behavior cloning, a world modeling recipe enables counterfactual reasoning with context-specific information at inference time, leading to stronger zero-shot generalization. Moreover, world models can enable the optimization of action sequences that maximize predicted rewards. However, we demonstrate that naively utilizing a video model for inference-time optimization can result in *world model hacking*, where optimized action sequences exploit hallucinations in the world model without leading to successful real-world executions. To mitigate this challenge, we develop a novel *uncertainty quantification* method that constrains action optimization to high-confidence regions. Further, we introduce a novel *hierarchical reward prediction* model to capture both semantic and fine-grained task progress, designed within a *flow map framework* for real-time optimization of video plans. Through extensive experiments on manipulation tasks, we demonstrate that VidPlan achieves improved generalization, higher task success rates, and stronger instruction following compared to SOTA vision-language-action baselines.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
PDF: pdf
Submission Number: 6
Loading