Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Juan A. Rodriguez; Haotian Zhang; Abhay Puri; Rishav Pramanik; Aarash Feizi; Pascal Wichmann; Arnab Kumar Mondal; Mohammad Reza Samsami; Rabiul Awal; Perouz Taslakian; Spandana Gella; Sai Rajeswar; David Vazquez; Christopher Pal; Marco Pedersoli

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Juan A. Rodriguez, Haotian Zhang, Abhay Puri, Rishav Pramanik, Aarash Feizi, Pascal Wichmann, Arnab Kumar Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli

Published: 18 Sept 2025, Last Modified: 10 Dec 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: SVG, Scalable Vector Graphics, Multimodal, VLM, Reinforcement Learning

TL;DR: We refine SVG generation using online reinforcement learning with image reconstruction, semantic, and code-level rewards, boosting accuracy, efficiency, and interpretability.

Abstract: Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-language models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce Reinforcement Learning from Rendering Feedback, an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. \method significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 6709

Loading