PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

Hui Shen; Taiqiang Wu; Qi Han; Yunta Hsieh; Jizhou Wang; Yuyue Zhang; Yuxin Cheng; Zijian Hao; Yuansheng Ni; Xin Wang; Zhongwei Wan; Kai Zhang; Wendong XU; Jing Xiong; Ping Luo; Wenhu Chen; Chaofan Tao; Zhuoqing Mao; Ngai Wong

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong XU, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, Ngai Wong

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LMM Evaluation, Physical Benchmark, Reasoning Benchmark

Abstract: Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models' capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave & acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-o4-mini, Gemini-2.5-Pro, and GPT-5 achieve only 45.8%, 62.4%, and 65.2% accuracy respectively—performance gaps exceeding 10% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit and lmms-eval, enabling one-click evaluation.

Primary Area: datasets and benchmarks

Submission Number: 13409

Loading