Keywords: Interpretability and Analysis of Models for NLP, Language Models, Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning
Abstract: Current evaluations of multimodal large language models in physics rely predominantly on final answer accuracy, implicitly equating correct answers with correct reasoning. This assumption overlooks the structured nature of physics problem solving, which requires accurate perception of visual scenes, correct interpretation of problem descriptions, and principled application of physical concepts. As a result, existing benchmarks and evaluation protocols fail to expose critical reasoning failures, particularly when models produce fluent explanations or correct answers for the wrong reasons. This work addresses this gap by introducing a diagnostic evaluation for multimodal physics reasoning that goes beyond outcome-based metrics. We propose a fine-grained error taxonomy that disentangles perception, explanation, concept selection, and value interpretation errors, and apply it consistently across different reasoning settings and input modalities. Rather than ranking models, our analysis focuses on revealing how and why errors arise. By making hidden failure modes explicit, our evaluation provides more meaningful insights into multimodal model behavior and establishes a foundation for more rigorous and interpretable assessment of physics reasoning systems.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Interpretability and Analysis of Models for NLP,Language Models
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 7690
Loading