Keywords: Multimodal Input, Vector Graphics Generation, Unified Multimodal Models, Benchmark and Evaluation
TL;DR: We propose LayerVec, a multimodal framework that generates editable, layered SVG graphics via raster-to-vector deconstruction with new benchmarks and metrics.
Abstract: Scalable Vector Graphics (SVGs) are essential for modern design workflows, yet existing methods are confined to single-modality inputs and produce non-editable outputs.
To bridge this gap, we introduce LayerVec, the \textit{first} framework to synthesize editable, layered SVGs from multimodal prompts.
LayerVec is designed to operate on top of powerful Unified Multimodal Models (UMMs), employing a dual-stage pipeline: it first generates a raster guidance image, then uses an iterative deconstruction process to intelligently segment it into semantically coherent vector layers.
To facilitate rigorous evaluation, we conduct MUV-Bench, a comprehensive benchmark, and Layer-wise CLIP Consistency (LCC), a metric assessing structural editability.
Experiments show LayerVec significantly outperforms state-of-the-art baselines in producing structurally clean and semantically accurate SVGs.
We further demonstrate its robustness and model-agnostic nature by showing consistent performance gains across different UMM backbones.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19642
Loading