Keywords: Coding, Symbolic Representation, Multi-modal Reasoning
TL;DR: Benchmarking SVG Code Generation as Symbolic Visual Representation
Abstract: Code has emerged as a precise, executable medium for linguistic-centric tasks, leaving visual-centric coding underexplored. Conventional image representations rely on RGB pixels that capture visual appearance but offer limited symbolic abstraction. In this work, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers general commonsense, professional disciplines, and visual-centric perception. To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol where a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. We also introduce VCoder, an agentic framework that augments VLMs via test-time revision and visual tool use, yielding substantial improvements over strong baselines.
Submission Number: 20
Loading