IOMM: Fast Pre-training of Unified Multimodal Models without Text-Image Pairs

IOMM: Fast Pre-training of Unified Multimodal Models without Text-Image Pairs

ICLR 2026 Conference Submission12842 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Unified Multimodal model, generative model

Abstract: Unified Multimodal Models (UMMs), which integrate deep visual understanding with generative capabilities, are often constrained by inefficient training paradigms and a heavy reliance on scarce, high-quality text-image paired data. In this paper, we systematically analyze existing pre-training recipes for UMMs and identify these two issues as major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component using abundant unlabeled image-only data, thereby removing the dependency on paired data. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art performance. For example, our base model IOMM-B, trained generation module from scratch purely on open-source data using approximately only $\textbf{1050}$ H800 GPU hours (including $\textbf{1000}$ hours for image-only pre-training), attains a score of $\textbf{0.89}$ on the GenEval benchmark$\textemdash$surpassing strong baselines such as BAGEL (0.88) and BLIP3-o (0.84). Code will be released publicly.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 12842

Loading