Code Url: https://github.com/DRJCompSciWiz/PhysMent---An-Interactive-Approach-For-LLM-Reasoning-In-Physics-Problems/tree/main
Keywords: Physics reasoning, Large Language Models, Benchmarks, MuJoCo, Agentic AI
TL;DR: A benchmark that evaluates LLMs on physics problems through hands-on experimentation with a simulator, mirroring real scientific practice.
Abstract: Large language models (LLMs) perform strongly on static science benchmarks, yet their ability to reason about the physical world through _active experimentation_ remains poorly understood. We introduce **PhysMent**, a benchmark that evaluates LLM physical reasoning via iterative, tool-mediated interaction with a MuJoCo physics simulator. Unlike static benchmarks that supply all quantities upfront, PhysMent requires models to _discover_ information by applying forces, querying object states, advancing time, and modifying scene geometry before answering. The benchmark comprises 105 scenes of classical mechanics, organized across four difficulty regimes (Easy/Hard $\times$ Single/Multi), three scene modalities (standard, object creation, hidden objects), and a scene-manipulation category, evaluated with a six-dimensional scoring framework. Results show that current models perform reasonably well on qualitative single-concept tasks (up to 80\% accuracy) but degrade substantially on quantitative tasks that demand precise, multi-step experimental procedures: most models fall below 30\% on the hardest single-concept category, where the bottleneck is procedural (adaptive multi-step tool use) rather than conceptual load. Across the seven models, accuracy ranges from 25\% to 67\%, with failures due to premature answer submission, inefficient exploration, and inconsistent grounding in simulator feedback rather than conceptual gaps.
Submission Number: 189
Loading