Keywords: Coding, Symbolic Representation, Multi-modal Reasoning
TL;DR: SVG Code as Symbolic Visual Representation
Abstract: Code has emerged as a precise, executable medium for reasoning and action in the agent era. Yet progress has largely focused on linguistic-centric tasks, such as program synthesis and debugging, leaving visual-centric coding underexplored. Conventional image representations rely on dense RGB pixels that capture appearance but provide limited symbolic abstraction. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three challenging domains—general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVG; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues (objects, shapes, text) beyond intrinsic model capacity. Across benchmarks, frontier VLMs with strong reasoning score well overall yet remain limited on professional knowledge and 3D reasoning; VCoder delivers a +8.7 point overall gain over the top-performing Claude-4-Opus. Human studies further show that although VLMs score higher on raw images, humans are more robust on rendered SVGs—underscoring symbolic visual coding as a promising paradigm for human-like multimodal intelligence.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5638
Loading