Keywords: Vision Language Models
Abstract: Perception enables humans and animals to interact with the world.
Combined with our knowledge of physical laws, we rely on perception to complete many tasks such as tracking a ball or driving a car.
While we learn these physical laws through perception, proprioception, and physical interaction, Vision Language Models (VLMs) have seemingly learned the same physical laws strictly through natural language.
In this paper, we leverage existing puzzles designed to test physical problem solving in humans to instead evaluate VLMs.
Unlike prior work evaluating only artificial agents, we leverage games designed for and tested on humans.
We design a new reinforcement learning environment over the original game that allows VLMs to play the game in multiple rounds.
With a direct comparison between humans and VLMs on the same 28 levels with the same number of attempts, we show that humans average a 61\% pass rate within 4 attempts, while the best-performing VLM without access to code representing the game state solves only 9/28 levels. Only Gemini 3 Pro with a textual JSON representation of the game state exceeds human-level performance (23/28), revealing that physical reasoning is unlocked through text, not vision.
Our results suggest that while VLMs can describe Newton's Laws of Motion in detail, they cannot apply these principles to solve challenging physical problems that humans can often solve in just a few attempts.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 31
Loading