Can vision language models learn intuitive physics from interaction?

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision language models, Intuitive physics, Interaction, Cognitive Science, Computational Cognitive Science, Human-like machine learning
Abstract: Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning, as well as models that learn without interaction using supervised fine-tuning. While both reinforcement learning and supervised fine-tuning appear to improve within-task performance, they fail to produce models with generalizable physical intuitions. Models trained on one task do not reliably generalize to related tasks, even if they share visual statistics and physical principles, and regardless of whether they are trained through interaction.
Primary Area: applications to neuroscience & cognitive science
Submission Number: 24501
Loading