Beyond Accuracy: Diagnosing Large Multimodal Models Reasoning Failures in Multimodal Physics

Beyond Accuracy: Diagnosing Large Multimodal Models Reasoning Failures in Multimodal Physics

ACL ARR 2026 January Submission7690 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretability and Analysis of Models for NLP, Language Models, Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning

Abstract: Current evaluations of multimodal large language models in physics rely predominantly on final answer accuracy, implicitly equating correct answers with correct reasoning. This assumption overlooks the structured nature of physics problem solving, which requires accurate perception of visual scenes, correct interpretation of problem descriptions, and principled application of physical concepts. As a result, existing benchmarks and evaluation protocols fail to expose critical reasoning failures, particularly when models produce fluent explanations or correct answers for the wrong reasons. This work addresses this gap by introducing a diagnostic evaluation for multimodal physics reasoning that goes beyond outcome-based metrics. We propose a fine-grained error taxonomy that disentangles perception, explanation, concept selection, and value interpretation errors, and apply it consistently across different reasoning settings and input modalities. Rather than ranking models, our analysis focuses on revealing how and why errors arise. By making hidden failure modes explicit, our evaluation provides more meaningful insights into multimodal model behavior and establishes a foundation for more rigorous and interpretable assessment of physics reasoning systems.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: Interpretability and Analysis of Models for NLP,Language Models

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 7690

Loading