Abstract: Unified multimodal understanding and generation have attracted much attention in the field of vision and language in recent years. Existing unified models (UniMs) aim to simultaneously learn understanding and generation capabilities, which require a large amount of computational resources and have defects in two aspects: 1) difficulty in generating interleaved text-image content; 2) weaker understanding capabilities than multimodal large language models (MLLMs). To bridge this gap, we propose ARMOR, a resource-efficient framework designed to ``upgrade'' rather than ``retrain from scratch'' expert MLLMs. Our core principle is to endow MLLMs with generation capabilities while preventing catastrophic forgetting of their top-tier understanding capabilities. We achieve this goal through three key innovations: (1) an asymmetric architecture that isolates a lightweight generative decoder from the frozen MLLM core via a forward-switching mechanism to enable seamless interleaved generation; (2) a meticulously curated high-quality interleaved dataset; (3) a progressive ``What or How to Generate'' (WoHG) three-stage training algorithm. Experiments demonstrate that ARMOR successfully upgrades a leading MLLM, retaining over 95\% of its original understanding performance while achieving highly competitive image generation at less than 1/70 the cost of training from scratch. This demonstrates the effectiveness of our core idea: ``the efficient paradigm of upgrading and expanding existing expert MLLMs into UniMs.''
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Sungwoong_Kim2
Submission Number: 6355
Loading