Toward Editable Vector Graphics: Layered SVG Synthesis from Multimodal Prompts

Toward Editable Vector Graphics: Layered SVG Synthesis from Multimodal Prompts

ICLR 2026 Conference Submission19642 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Input, Vector Graphics Generation, Unified Multimodal Models, Benchmark and Evaluation

TL;DR: We propose LayerVec, a multimodal framework that generates editable, layered SVG graphics via raster-to-vector deconstruction with new benchmarks and metrics.

Abstract: Scalable Vector Graphics (SVGs) are essential for modern design workflows, yet existing methods are confined to single-modality inputs and produce non-editable outputs. To bridge this gap, we introduce LayerVec, the \textit{first} framework to synthesize editable, layered SVGs from multimodal prompts. LayerVec is designed to operate on top of powerful Unified Multimodal Models (UMMs), employing a dual-stage pipeline: it first generates a raster guidance image, then uses an iterative deconstruction process to intelligently segment it into semantically coherent vector layers. To facilitate rigorous evaluation, we conduct MUV-Bench, a comprehensive benchmark, and Layer-wise CLIP Consistency (LCC), a metric assessing structural editability. Experiments show LayerVec significantly outperforms state-of-the-art baselines in producing structurally clean and semantically accurate SVGs. We further demonstrate its robustness and model-agnostic nature by showing consistent performance gains across different UMM backbones.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 19642

Loading