Beyond Accuracy: Evaluating Multimodal Mathematical and Scientific Reasoning Through Error Analysis and Self-Correction
Keywords: Vision-Language Models, Self-Correction, Multimodal Reasoning, Cross-lingual Evaluation
TL;DR: Open-source vision-language models fail dramatically at advanced mathematical reasoning compared to frontier models, and critically lack self-correction abilities despite being able to detect their own errors.
Abstract: While contemporary large vision-language models achieve impressive performance on standard benchmarks, their reasoning depth remains poorly understood. We evaluate multimodal mathematical and scientific reasoning through comprehensive error analysis and self-correction assessment using challenging bilingual (English and Hindi) problems from the Joint Entrance Examination (JEE) Advanced. Our evaluation of eleven models reveals that while frontier models achieve 76.8-83.9% accuracy, open-source alternatives reach only 10.9-50.9%, a significant performance gap not observed on existing benchmarks like MMMU. We also observe instruction-following failures and adherence to English despite input prompts in other languages on SoTA models. Most critically, our self-correction pipeline shows that models can only correct less than 10% of their responses despite 30-79% error detection and 31-55% pass@k accuracy improvements. Finally, these findings indicate that the cognitive demands of sequential self-reflection exceed current model capabilities. We publicly release our codebase and data: https://anonymous.4open.science/r/mmJEE-Eval-D14F
Submission Number: 69
Loading