Inference-Time Planning with Action-Conditioned Video Models for Generalizable Robot Manipulation
Reviewer: ~Ola_Sho1
Keywords: action-conditioned video prediction, uncertainty quantification, robot manipulation, robot learning
TL;DR: We propose a novel inference-time planning framework that uses action-conditioned video models as high-fidelity world models for robotic manipulation to improve generalization of behavior-cloned policies.
Abstract: We propose **VidPlan**, an inference-time planning
framework that uses action-conditioned video models as high-
fidelity world models for robot manipulation. In contrast to state-
of-the-art (SOTA) generalist policies learned via behavior cloning,
a world modeling recipe enables counterfactual reasoning with
context-specific information at inference time, leading to stronger
zero-shot generalization. Moreover, world models can enable
the optimization of action sequences that maximize predicted
rewards. However, we demonstrate that naively utilizing a
video model for inference-time optimization can result in
*world model hacking*, where optimized action sequences exploit
hallucinations in the world model without leading to successful
real-world executions. To mitigate this challenge, we develop a
novel *uncertainty quantification* method that constrains action
optimization to high-confidence regions. Further, we introduce
a novel *hierarchical reward prediction* model to capture both
semantic and fine-grained task progress, designed within a
*flow map framework* for real-time optimization of video plans.
Through extensive experiments on manipulation tasks, we
demonstrate that VidPlan achieves improved generalization,
higher task success rates, and stronger instruction following
compared to SOTA vision-language-action baselines.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
PDF: pdf
Submission Number: 6
Loading