Primitive Vision: Improving Diagram Understanding in MLLMs

Shan Zhang; Aotian Chen; Yanpeng Sun; Jindong Gu; Yi-Yu Zheng; Piotr Koniusz; Kai Zou; Anton van den Hengel; Yuan Xue

Primitive Vision: Improving Diagram Understanding in MLLMs

Shan Zhang, Aotian Chen, Yanpeng Sun, Jindong Gu, Yi-Yu Zheng, Piotr Koniusz, Kai Zou, Anton van den Hengel, Yuan Xue

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce PRIMITIVE, a geometry-grounded model that boosts fine-grained visual understanding in Multimodal LLMs, significantly improving performance on mathematical diagram reasoning by addressing geometric perception challenges.

Abstract: Mathematical diagrams have a distinctive structure. Standard feature transforms designed for natural images (e.g., CLIP) fail to process them effectively, limiting their utility in multimodal large language models (MLLMs). Current efforts to improve MLLMs have primarily focused on scaling mathematical visual instruction datasets and strengthening LLM backbones, yet fine‐grained visual recognition errors remain unaddressed. Our systematic evaluation on the visual grounding capabilities of state‐of‐the‐art MLLMs highlights that fine‐grained visual understanding remains a crucial bottleneck in visual mathematical reasoning (GPT-4o exhibits a 70% grounding error rate, and correcting these errors improves reasoning accuracy by 12%). We thus propose a novel approach featuring a geometrically‐grounded vision encoder and a feature router that dynamically selects between hierarchical visual feature maps. Our model accurately recognizes visual primitives and generates precise visual prompts aligned with the language model’s reasoning needs. In experiments, PRIMITIVE-Qwen2.5-7B outperforms other 7B models by 12% on MathVerse and is on par with GPT-4V on MathVista. Our findings highlight the need for better fine‐grained visual integration in MLLMs. Code is available at github.com/AI4Math-ShanZhang/SVE-Math.

Lay Summary: Mathematical diagrams—like those showing geometric shapes or formula illustrations—are common in textbooks and exams, but most AI models struggle to “see” them correctly, leading to mistakes when solving the problems they illustrate. To address this, we built a specialized computer “eye” (a vision module) that is trained to recognize the individual lines, angles, and shapes in a math diagram rather than treating it like an ordinary photograph. This module works together with a language‐understanding system, sending it clear, fine‐grained visual cues so that the AI can reason about the math content properly. In tests, our approach helped the AI reduce errors by more than 10 percent when answering diagram‐based questions. By teaching AI to interpret diagrams more accurately, we make it easier to build digital tutors, automated grading systems, and other educational tools that rely on understanding complex visual math content.

Link To Code: https://github.com/AI4Math-ShanZhang/SVE-Math

Primary Area: Deep Learning->Foundation Models

Keywords: Multimodal Large Language Models, Visual Mathematical Understanding, Visual Perception, Geometric Primitive Grounding

Submission Number: 2063

Loading