MVP: Multi-scale Visual Prompt for Visual AutoRegressive Generation

ICLR 2026 Conference Submission769 Authors

02 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Prompt Tuning, Visual AutoRegressive, Text-to-Image Generation
Abstract: Prompt tuning, especially perturbation-based prompt tuning, encounters obstacles in visual generation. On the one hand, the autoregressive paradigm, which provides the most ideal environment for prompt tuning, struggles to model planar concept: traditional autoregressive methods employ raster-scan for image modeling, disrupting the spatial structure of images. On the other hand, perturbation-based prompts work as learnable perturbations in pixel space, and their effectiveness comes at quite a little computational cost, making it difficult to balance performance and efficiency. To address these challenges, we propose Multi-scale Visual Prompt (MVP), a perturbation-based prompt tuning method tailored for visual autoregressive generation with planar concept and efficient information propagation. MVP builds on Visual AutoRegressive (VAR) models with next-scale prediction for capturing planar concept, and introduces prompt tokens in the outermost token frame at each scale for efficient signal control and information propagation. During training, we use increasingly detailed tuning text to facilitate prompt learning. Moreover, MVP extends VAR's capability for text-to-image generation. Extensive experiments validate the effectiveness of MVP. Code is available.
Supplementary Material: pdf
Primary Area: generative models
Submission Number: 769
Loading