PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

ICLR 2026 Conference Submission13409 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LMM Evaluation, Physical Benchmark, Reasoning Benchmark
Abstract: Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models' capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave & acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-o4-mini, Gemini-2.5-Pro, and GPT-5 achieve only 45.8%, 62.4%, and 65.2% accuracy respectively—performance gaps exceeding 10% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit and lmms-eval, enabling one-click evaluation.
Primary Area: datasets and benchmarks
Submission Number: 13409
Loading