Abstract: Despite significant advancements, current large multimodal models (LMMs) struggle to bridge the gap between low-level visual perception—focusing on shapes, sizes, and layouts—and high-level language reasoning involving semantics, events, and logic. This limitation becomes evident in tasks requiring precise visual perception, such as comparing geometric properties or solving visual algorithmic reasoning problems. To study this failure mode, we focus on an important visual domain: vector graphics —images composed purely of 2D objects and shapes, which are prevalent in Web, PC, and Mobile environments. Importantly, we consider rasterized vector graphics without assuming access to their underlying vector code. We identify two key research questions: how can we enable precise visual perception, and how can we facilitate high-level reasoning based on such low-level perceptions? To accurately capture low-level visual details, we explore using SVG for the precise encoding of visual scenes. However, SVGs are not readily interpretable by LLMs or LMMs in a zero-shot manner. To address this challenge, we propose the Visually Descriptive Language Model (VDLM) to build a bridge between low-level visual perception and high-level language reasoning. VDLM learns an intermediate symbolic representation called Primal Visual Description (PVD), which translates raw SVGs into a higher-level abstraction comprising primitive attributes. This abstraction allows for direct interpretation by foundation models for zero-shot generalization to different reasoning tasks. Without any human-annotated data, VDLM leads to significant improvements in state-of-the-art LMMs, such as GPT-4o, across various low-level multimodal perception and reasoning tasks on rasterized vector graphics. Additionally, we provide extensive analyses of VDLM’s performance, showing that our framework offers improved interpretability due to its disentangled perception and reasoning processes. As the first attempt to construct a descriptive intermediate representation for low-level visual reasoning, we also conduct an in-depth error analysis, highlighting remaining limitations and suggesting directions for future research.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: [Camera Ready Changes]
- Textual edits in abstraction and intro to reflect the meta review suggestions.
- Adding rebuttal content into the Appendix and pointers in the main text.
[Rebuttal Changes]
- Further Analysis of SVG-based Visual Encoding Quality
- VTracer Encoding Quality: Added verification of near-perfect SVG reconstructions via VTracer.
- Impact of VTracer Error to the End-task Performance: Introduced analysis of end-task robustness to vectorization errors, grouping instances by VTracer quality and plotting accuracies.
- Additional Experiments on PVD Perception
- Different LLM choices for SVG-to-PVD model: Compared perception and end-task scores when using Qwen-2.5-7B vs. Mistral-7B backbones.
- Ablation with PNG-to-PVD model: Added ablation showing PNG-to-PVD’s higher loss and poorer perception performance vs. SVG-to-PVD.
- Additional Experiments Using Open-Source LMMs as Reasoners
- Demonstrated VDLM’s effectiveness with Qwen-2.5-VL-72B, showing a 2 % overall gain across 9 tasks.
- Additional Qualitative Examples on PVD Parsing Novel Concepts
- Showcased PVD’s ability to compose novel shapes (e.g., star, cross, circle segment).
- Further Exploration in Prompt Engineering
- Added a verification-step prompt (“double-check whether the objects in the PVD perception match the image”) and reported its impact on ShapeWorld tasks.
Supplementary Material: zip Submission Number: 4467