Chain of Time: In-Context Physical Simulation with Image Generation Models

ICLR 2026 Conference Submission23193 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-modal Language Models, Spatial and Temporal Perception, Image Generation, Physical Reasoning
Abstract: We propose a novel method to improve the physical simulation ability of vision-language models. This Chain-of-Time simulation is motivated by in-context reasoning in machine learning, and mental simulation in humans. The method involves generating a series of intermediate images during a simulation. Chain of Time is used at inference time and requires no additional fine-tuning for performance benefits. We apply the Chain-of-Time method to synthetic and real-world domains, including 2-D graphics simulations and natural 3-D videos. These domains test a variety of particular physical properties, including velocity, acceleration, fluid dynamics, and conservation of momentum. We found that using Chain-of-Time simulation substantially improves the performance of state-of-the-art Image Generation Model. Beyond examining performance, we also analyze the specific states of the world simulated by an image model at each time step, which sheds light on the dynamics underlying these simulations. This analysis reveals insights that are hidden from traditional evaluations of physical reasoning, including cases where an Image Generation Model is able to simulate physical properties that unfold over time, such as velocity, gravity, and collisions domain well. Our analysis also highlights particular cases where the Image Generation Model struggles to infer particular physical parameters from input images, despite being capable of simulating relevant physical processes.
Primary Area: interpretability and explainable AI
Submission Number: 23193
Loading