PhysMent: An Interactive Approach For LLM Reasoning In Physics Problems

Joseph Chan; Utkarsh Jha; Xiyin Yang; Abhinav Jarajapu; Anik Sahai; Eddie Hu; Robin Jeshua Deepak; Stefano Saravalle; Aditya Shah

PhysMent: An Interactive Approach For LLM Reasoning In Physics Problems

Joseph Chan, Utkarsh Jha, Xiyin Yang, Abhinav Jarajapu, Anik Sahai, Eddie Hu, Robin Jeshua Deepak, Stefano Saravalle, Aditya Shah

Published: 17 Jun 2026, Last Modified: 26 Jun 2026ICML 2026 AI4Math Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Code Url: https://github.com/DRJCompSciWiz/PhysMent---An-Interactive-Approach-For-LLM-Reasoning-In-Physics-Problems/tree/main

Keywords: Physics reasoning, Large Language Models, Benchmarks, MuJoCo, Agentic AI

TL;DR: A benchmark that evaluates LLMs on physics problems through hands-on experimentation with a simulator, mirroring real scientific practice.

Abstract: Large language models (LLMs) perform strongly on static science benchmarks, yet their ability to reason about the physical world through _active experimentation_ remains poorly understood. We introduce **PhysMent**, a benchmark that evaluates LLM physical reasoning via iterative, tool-mediated interaction with a MuJoCo physics simulator. Unlike static benchmarks that supply all quantities upfront, PhysMent requires models to _discover_ information by applying forces, querying object states, advancing time, and modifying scene geometry before answering. The benchmark comprises 105 scenes of classical mechanics, organized across four difficulty regimes (Easy/Hard $\times$ Single/Multi), three scene modalities (standard, object creation, hidden objects), and a scene-manipulation category, evaluated with a six-dimensional scoring framework. Results show that current models perform reasonably well on qualitative single-concept tasks (up to 80\% accuracy) but degrade substantially on quantitative tasks that demand precise, multi-step experimental procedures: most models fall below 30\% on the hardest single-concept category, where the bottleneck is procedural (adaptive multi-step tool use) rather than conceptual load. Across the seven models, accuracy ranges from 25\% to 67\%, with failures due to premature answer submission, inefficient exploration, and inconsistent grounding in simulator feedback rather than conceptual gaps.

Submission Number: 189

Loading