LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Zeyu Wang; Zilong Chen; Chenhui Gou; Feng Li; Chaorui Deng; Deyao Zhu; Kunchang Li; Weihao Yu; Haoqin Tu; Cihang Xie; Haoqi Fan

LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Cihang Xie, Haoqi Fan

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Unified Multimodal Understanding and Generation Modeling; Text-to-Image Generation; Image Editing

TL;DR: Strong unified multimodal models trained with only 35B tokens!

Abstract: Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models. Specifically, our key design is to retain the original VLM and DiT blocks while additionally interleaving multimodal self-attention blocks throughout the network. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from a ViT encoder with low-level spatial signals from a VAE encoder. By training with only $\sim35B$ tokens, this approach achieves strong results across multiple benchmarks: 0.89 on GenEval for compositional text-to-image generation, 82.28 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.65 on ImgEdit-Bench for image editing. We will fully release the entire suite of code, model weights, and datasets to support future research on unified multimodal modeling.

Primary Area: generative models

Submission Number: 6777

Loading