MENTOR: Efficient Autoregressive Image Generation with Balanced Multimodal Control

ACL ARR 2026 January Submission6420 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: data-efficient training, cross-modal content generation, multimodality
Abstract: Recent text-to-image models achieve impressive visual quality but still face challenges in precise controllability, balancing multimodal inputs, and high training cost for multimodal image generation. To address these limitations, we propose MENTOR, an autoregressive (AR) framework with a two-stage training paradigm for controllable multimodal image generation: (1) a multimodal alignment stage that establishes robust pixel and semantic-level alignment between inputs and generated tokens, followed by (2) a multimodal instruction tuning stage that balance model's integration of multimodal inputs and enhance generation controllability. Extensive experiments on DreamBench++ and DreamBench demonstrate that, despite modest model size and training resources, \model achieves a strong balance between textual and visual guidance for controllable image generation, delivering competitive performance at significantly lower computational cost compared to leading baselines. Moreover, our approach attains superior image reconstruction fidelity, broad adaptability across different tasks, and training efficiency.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: data-efficient training, cross-modal content generation, multimodality
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 6420
Loading