Keywords: multimodal reasoning, large language models, multimodal benchmark, reasoning benchmark
TL;DR: New multimodal benchmark of standardized questions that crushes the best VLMs
Abstract: Large Multimodal Models (LMMs) have made remarkable progress in bridging language and vision, yet their performance on visually grounded scientific and exam-style reasoning tasks remains far below human-level ability. To systematically probe these limitations, we introduce YKSBench, a multimodal benchmark of 2,047 university entrance exam questions spanning mathematics, geometry, physics, chemistry, biology, and geography. Unlike prior benchmarks that focus narrowly on mathematics or synthetic tasks, YKSBench captures diverse question formats where visual information is indispensable. Despite the apparent simplicity of many problems to humans, state-of-the-art LMMs show striking deficiencies: the best-performing proprietary model, GPT-5, reaches only 39.34\% accuracy, while the strongest open-source model, Gemma 3 27B, achieves 26.82\% accuracy. We further provide qualitative analyses and novel scientific figures illustrating failure modes where models misinterpret diagrams, schematics, or spatial layouts. Positioned as a complementary resource to existing benchmarks such as MathVista, MathVision, and MMStar, YKSBench broadens the evaluation landscape and emphasizes the urgent need for improved visual reasoning in LMMs. Dataset is open-sourced at \href{https://huggingface.co/datasets/metu-yks/yksbench}{metu-yks/YKSBench}.
Submission Number: 233
Loading