Keywords: Large Language Models, Physical Understanding, Benchmark, Reinforcement Learning
TL;DR: We introduce SimuPhy, a benchmark dataset and a closed-loop RL framework that teaches LLMs to understand and simulate physical processes by linking code, video, and VLM-based validation.
Abstract: Large language models (LLMs) have achieved remarkable progress in mathematics and code generation, yet their ability to reason about the physical world remains underexplored. Unlike mathematical reasoning, which can be expressed symbolically in text, physical reasoning is inherently tied to motion and dynamic processes. In this paper, we present SimuPhy, a novel task and dataset for evaluating LLMs’ understanding, reasoning, and coding-based representation of physical laws. In SimuPhy, a model is given a motion description and tasked with generating code that simulates it. The resulting simulation is executed into a video, which is then evaluated by a vision–language model with predefined verification questions. SimuPhy contains 7,625 motion descriptions, including a curated 300-example test set with human verification. We evaluated 10 advanced LLMs, and find that even the strongest model, Deepseek-671B achieves only 20.6\% pass rate, highlighting the difficulty of the task and the lack of physical law reasoning in current models. Building on this setup, we explore reinforcement learning with verifiable rewards (RLVR), pairing it with supervised fine-tuning (SFT) to improve models’ ability to generate physically consistent simulations. Together, SimuPhy and our verifiable reward training pipeline provide a foundation for bridging language models toward genuine physical understanding.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 7495
Loading