Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

Published: 12 May 2026, Last Modified: 12 May 2026Accepted by DMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.
Keywords: vision language models, plane geometry, benchmark, mathematical reasoning, visual perception
Changes Since Last Submission: We thank the Action Editor and reviewers for the positive and constructive feedback. We revised the manuscript according to the Action Editor’s minor revision request. The main changes are: strengthened clarification of the “atomic” skill definition, added qualitative failure analysis, added scorer and style-generation ablations, added a textbook-style external validation experiment, softened the domain-specific model claim, expanded related work, added limitations, standardized notation, and added fine-tuning hyperparameters. 1. We revised the manuscript to clarify that our use of “atomic” is operational rather than a formal irreducibility claim. We now acknowledge dependencies such as angle understanding relying on line/point detection. 2. Appendix F.1 and Figure 11 now discuss representative local failures, e.g., tangent/intersecting/disjoint relations and angle classification, and global failures, e.g., reflection and rotation. We also added a zoom-in diagnostic: for a small subset of originally incorrect GPT-4o predictions on local geometries, we re-evaluated with up to $3\times$ magnified images while verifying that the intended geometries were preserved. The limited corrections, together with our preprocessing-output analysis, suggest that these local errors may involve incorrect interpretation of visually available geometric relations, rather than being fully explained by visibility-related artifacts such as preprocessing degradation or insufficient scale. 3. Appendix E adds a scorer ablation to examine dependence on GPT-4o-mini. On 180 GPT-4o responses covering AVSD-h, AVSD-s, AVSD-c, and 18 target skills, the accuracies are 61.1% with GPT-4o-mini scoring, 63.3% with Gemini 3 Flash judgment-only scoring, and 62.8% with Gemini 3 Flash end-to-end scoring. This small gap suggests that the main evaluation results are not highly sensitive to the specific scoring model. 4. Appendix F.3 and Figure 10 add a style-generation ablation using Qwen-Image-Edit instead of ControlNet. We generated 450 styled variants from 90 AVSD-s examples across 18 skills. GPT-4o accuracy drops from 58.9% on the original images to 47.1% on the styled variants, close to the original AVSD-s to AVSD-c drop, supporting that style sensitivity is not specific to ControlNet. 5. Appendix F.4, Table 13, and Figure 13 add Textbook-50, a PGPS9K-based validation using 50 textbook-style diagrams. We created AVSD-style perceptual questions answerable from existing diagram annotations. The results provide an external check that the perceptual challenges studied in AVSD are not limited to our clean synthetic diagrams. 6. We softened the domain-specific model claim. The manuscript now states that geometry-specialized models are not necessarily guaranteed to outperform general-purpose VLMs on AVSD, rather than claiming that domain-specific models are generally not better. 7. We added limitations clarifying that AVSD is a controlled 2D diagnostic benchmark, does not fully cover real textbook/classroom diagrams, does not evaluate 3D geometric understanding, and uses coarse author-defined easy/medium/hard labels rather than calibrated human difficulty estimates. 8. We expanded related work with IR3D-Bench, JarvisIR, and JarvisArt. 9. We standardized notation including LLaVA-NeXT, GPT-4o-mini, $\nu$-geometry, LN-13B, and LN-34B; and added fine-tuning details in Appendix D.
Assigned Action Editor: ~Sergio_Escalera1
Submission Number: 135
Loading