MENTOR: Efficient Multimodal-Conditioned Tuning for Language-Guided Visual Generation Agents
Keywords: Multi-Modality
Abstract: Recent text-to-image models achieve impressive visual quality but still face challenges in precise controllability, balancing multimodal inputs, and high training cost for multimodal image generation.
To address these limitations, we propose \textbf{\model}, an autoregressive (AR) framework with a two-stage training paradigm for controllable multimodal image generation:
(1) a \textit{multimodal alignment stage} that establishes robust pixel and semantic-level alignment between inputs and generated tokens, followed by (2) a \textit{multimodal instruction tuning stage} that balances the model's integration of multimodal inputs and enhances generation controllability.
Extensive experiments on DreamBench++ and DreamBench demonstrate that, despite modest model size and training resources, \model achieves a strong balance between textual and visual guidance for controllable image generation, delivering competitive performance at significantly lower computational cost compared to leading baselines. Moreover, our approach attains superior image reconstruction fidelity, broad adaptability across different tasks, and training efficiency.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 140
Loading