Understanding Diagrams with Explicit Intermediate Visual Representation

Published: 17 Sept 2025, Last Modified: 06 Nov 2025ACS 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diagram Understanding, Spatial and Geometric Reasoning, Information Extraction
Abstract: Diagram understanding remains a challenge for current Vision-Language Models (VLMs), which often fail to accurately capture the fine-grained spatial and relational information essential for deep comprehension. Furthermore, their opaque internal states hinder effective human-machine collaboration. Inspired by human cognition, we propose an alternative approach that prioritizes the creation of explicit, human-readable representations. Producing intermediate visual representations that are compatible with the cognitively-inspired CogSketch, our system extends the effort of Hybrid Primal Sketch, which combines computer vision techniques to produce structured, symbolic descriptions of diagrams for CogSketch to further encode. This method generates explicit representations of visual elements and their qualitative spatial relationships, which can then support higher-level visual reasoning. Our approach is highly interpretable, lightweight, and training-free. We demonstrate its advantage on diagram understanding by extracting the underlying structural information in two genres of charts and diagrams.
Paper Track: Technical paper
Submission Number: 42
Loading