Keywords: Benchmark, Mathematical Reasoning, Vision-Language Models (VLMs), Automated Assessment, Symbolic Reasoning, Error Analysis
TL;DR: This paper introduces CHECK-MAT, a benchmark to assess if AI is ready for the high-stakes, socially-sensitive task of grading real handwritten math exams
Abstract: AI-powered assistive technologies hold immense potential to create more accessible and equitable educational environments. A key barrier, however, is the laborious and subjective process of grading complex, handwritten assignments, which limits the accessibility of timely and consistent feedback for students and overwhelms educators. While Vision-Language Models (VLMs) are a promising solution, their readiness to function as a reliable assistive tool must be rigorously evaluated to prevent unfair outcomes. To this end, we introduce CHECK-MAT, the first benchmark designed to assess the capabilities of VLMs as an assistive technology for grading handwritten, multi-step mathematical solutions from a real-world national exam. Our benchmark is composed of 122 scanned solutions from the Russian Unified State Exam (EGE), complete with official expert grades, providing a realistic testbed for this accessibility challenge. We evaluate seven modern VLMs and find their performance remains significantly below the level required for reliable use, especially in understanding the logical steps of human reasoning. Our findings chart a path for future research, highlighting the core challenges that must be overcome to develop the next generation of trustworthy, fair, and genuinely assistive AI technologies that can empower both educators and learners. You can find code in https://github.com/Karifannaa/Auto-check-EGE-math.
Submission Number: 16
Loading