RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance

Jiuniu Wang; Gongjie Zhang; Quanhao Qian; Junlong Gao; Deli Zhao; Ran Xu

RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance

Jiuniu Wang, Gongjie Zhang, Quanhao Qian, Junlong Gao, Deli Zhao, Ran Xu

10 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scalable Vector Graphics Generation, Multimodal Guidance, Interactive Robot Drawing, Multimodal Large Language Models

TL;DR: RoboSVG is a unified multimodal framework, trained on the large-scale RoboDraw dataset, that generates and refines interactive SVGs from text, images, and partial inputs, achieving state-of-the-art performance in versatile SVG generation.

Abstract: Scalable Vector Graphics (SVGs) are fundamental to digital design and robot control, encoding not only visual structure but also motion paths in interactive drawings. In this work, we introduce RoboSVG, a unified multimodal framework for generating interactive SVGs guided by textual, visual, and numerical signals. Given an input query, the RoboSVG model first produces multimodal guidance, then synthesizes candidate SVGs through dedicated generation modules, and finally refines them under numerical guidance to yield high-quality outputs. To support this framework, we construct RoboDraw, a large-scale dataset of one million examples, each pairing an SVG generation condition (e.g., text, image, and partial SVG) with its corresponding ground-truth SVG code. RoboDraw dataset enables systematic study of four tasks, including basic generation (Text-to-SVG, Image-to-SVG) and interactive generation (PartialSVG-to-SVG, PartialImage-to-SVG). Extensive experiments demonstrate that RoboSVG achieves superior query compliance and visual fidelity across tasks, establishing a new state of the art in versatile SVG generation. The dataset and source code of this project will be publicly available soon.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 3565

Loading