Let’s Simulate Frame-by-Frame: In-Context Physical Simulations with Vision-Language Models

YingQiao Wang; Eric J Bigelow; Tomer Ullman

Let’s Simulate Frame-by-Frame: In-Context Physical Simulations with Vision-Language Models

YingQiao Wang, Eric J Bigelow, Tomer Ullman

Published: 10 Jun 2025, Last Modified: 14 Jul 2025ICML 2025 World Models WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-modal Language Models, Spatial and Temporal Perception, Intuitive Physics, Cognitive Science

Abstract: In recent years, multi-modal Vision-Language Models (VLMs) have improved substantially in their ability to generate realistic images. This raises important questions about what sort of representation these models have of the world, in particular, how they represent physical objects and their motion over time. We adopt an experimental paradigm from prior work in cognitive science to study physical reasoning. To improve the physical simulation ability of VLMs, we propose a novel method inspired by in-context reasoning and the psychology of mental simulation, which we call Chain-of-Time simulation. In our experiments, we find that a state-of-the-art VLM is able to simulate into the future, but with great errors. This performance is substantially improved when the Chain-of-Time simulation is used, and in this case we also find a human-like bias where a simulation slows down the longer a simulation is run for.

Submission Number: 28

Loading