MVP: Multi-scale Visual Prompt for Visual AutoRegressive Generation

Jingyi Liu; Yu Zhang; Changwei Wang; Tianyu Wang; Zhongwei Wan; Duoqian Miao; Zhifei Zhang; Qi Zhang; Longbing Cao

MVP: Multi-scale Visual Prompt for Visual AutoRegressive Generation

Jingyi Liu, Yu Zhang, Changwei Wang, Tianyu Wang, Zhongwei Wan, Duoqian Miao, Zhifei Zhang, Qi Zhang, Longbing Cao

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Prompt Tuning, Visual AutoRegressive, Text-to-Image Generation

Abstract: Prompt tuning, especially perturbation-based prompt tuning, encounters obstacles in visual generation. On the one hand, the autoregressive paradigm, which provides the most ideal environment for prompt tuning, struggles to model planar concept: traditional autoregressive methods employ raster-scan for image modeling, disrupting the spatial structure of images. On the other hand, perturbation-based prompts work as learnable perturbations in pixel space, and their effectiveness comes at quite a little computational cost, making it difficult to balance performance and efficiency. To address these challenges, we propose Multi-scale Visual Prompt (MVP), a perturbation-based prompt tuning method tailored for visual autoregressive generation with planar concept and efficient information propagation. MVP builds on Visual AutoRegressive (VAR) models with next-scale prediction for capturing planar concept, and introduces prompt tokens in the outermost token frame at each scale for efficient signal control and information propagation. During training, we use increasingly detailed tuning text to facilitate prompt learning. Moreover, MVP extends VAR's capability for text-to-image generation. Extensive experiments validate the effectiveness of MVP. Code is available.

Supplementary Material: pdf

Primary Area: generative models

Submission Number: 769

Loading