Keywords: Multimodal Large Language Models, Visual Reasoning, Physics Benchmark, Model Evaluation
Abstract: Multimodal Large Language Models (MLLMs) show strong performance in visual reasoning, yet existing benchmarks for physics-related scenarios have significant limitations. To fill this gap, this paper proposes MV-Physics, a novel multi-dimensional benchmark tailored to physics visual scenarios, consisting of 8,011 junior high school physics questions from China's basic education system, covering 5 disciplinary subfields and 3 question types. A hierarchical, multi-dimensional evaluation framework—spanning single- to multi-image reasoning—is constructed to systematically assess MLLMs'physics visual reasoning abilities.Experimental results show Gemini 2.5 Pro achieves 88.34% accuracy in single-image tasks, while doubao-seed-1-6-vision-250815 reaches 89.96% in multi-image tasks. However, most models perform poorly on multiple-select questions, with accuracy below the 60% threshold. Moreover, analysis of reasoning efficiency in multi-image tasks indicates models still need substantial improvement in balancing accuracy and inference latency.This study provides a critical evaluation tool for advancing MLLMs in physics visual domains and serves as a "competence calibration metric" for intelligent teaching tools in basic education.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation,Multimodality and Language Grounding to Vision, Robotics and Beyond,Question Answering,Human-Centered NLP
Contribution Types: Data resources, Data analysis
Languages Studied: Chinese
Submission Number: 1113
Loading