Beyond Accuracy: Evaluating Multimodal Mathematical and Scientific Reasoning Through Error Analysis and Self-Correction

Arka Mukherjee; Shreya Ghosh

Beyond Accuracy: Evaluating Multimodal Mathematical and Scientific Reasoning Through Error Analysis and Self-Correction

Arka Mukherjee, Shreya Ghosh

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Self-Correction, Multimodal Reasoning, Cross-lingual Evaluation

TL;DR: Open-source vision-language models fail dramatically at advanced mathematical reasoning compared to frontier models, and critically lack self-correction abilities despite being able to detect their own errors.

Abstract: While contemporary large vision-language models achieve impressive performance on standard benchmarks, their reasoning depth remains poorly understood. We evaluate multimodal mathematical and scientific reasoning through comprehensive error analysis and self-correction assessment using challenging bilingual (English and Hindi) problems from the Joint Entrance Examination (JEE) Advanced. Our evaluation of eleven models reveals that while frontier models achieve 76.8-83.9% accuracy, open-source alternatives reach only 10.9-50.9%, a significant performance gap not observed on existing benchmarks like MMMU. We also observe instruction-following failures and adherence to English despite input prompts in other languages on SoTA models. Most critically, our self-correction pipeline shows that models can only correct less than 10% of their responses despite 30-79% error detection and 31-55% pass@k accuracy improvements. Finally, these findings indicate that the cognitive demands of sequential self-reflection exceed current model capabilities. We publicly release our codebase and data: https://anonymous.4open.science/r/mmJEE-Eval-D14F

Submission Number: 69

Loading