SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
Reviewer: ~Haowen_Liu2, ~Shaoxiong_Yao1
Keywords: Simulation, Robotic Manipulation, Vision-Language Models
TL;DR: We present SIMPACT, a training-free test-time framework that improves VLM action planning by using simulation-in-the-loop physical reasoning from a single RGB-D observation.
Abstract: Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities.
However, they lack a grounded understanding of physical dynamics.
This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes.
Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning.
To overcome this, we present $\textbf{SIMPACT}$, a test-time, $\textbf{SIM}$ulation-enabled $\textbf{ACT}$ion $\textbf{P}$lanning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training.
From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning.
By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way.
Our method demonstrates state-of-the-art performance on seven challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models.
Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
PDF: pdf
Submission Number: 35
Loading