UniFusion: Vision-Language Model as Unified Encoder in Image Generation and Editing

Yu-Teng Li; Manuel Brack; Sudeep Katakol; Hareesh Ravi; Ajinkya Kale

UniFusion: Vision-Language Model as Unified Encoder in Image Generation and Editing

Yu-Teng Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale

Published: 02 Mar 2026, Last Modified: 14 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: unified model, diffusion model, semantic representation

TL;DR: UniFusion proposes a framework to replace separate text and image encoders with a single VLM encoder, obtaining competitive image generation and editing with only semantic features, with emergent capabilities like zero-shot multi-ref generation.

Abstract: Most generative models still rely on separate encoders for text and images (e.g., large language models and VAE latents), which complicates high-fidelity editing and limits cross-modal knowledge transfer due to heterogenous embedding spaces. We present UniFusion, a framework of diffusion generative models conditioned solely on a frozen vision-language model (VLM) that serves as a unified multimodal encoder. UniFusion combines two ingredients. First, Layerwise Attention Pooling (LAP) aggregates representations across VLM layers to capture both high-level semantics and fine-grained details for both image and text. Second, we introduce VLM-Enabled Rewriting Injection with Flexible Inference (VeriFi), which conditions the diffusion transformer (DiT) on rewritten text tokens produced in-model by the conditioning VLM, improving distribution alignment between tasks, while leveraging the VLM's reasoning. To the best of our knowledge, UniFusion is the first architecture to perform competitive image editing using only VLM-based input conditioning, without auxiliary signals from a VAE or CLIP. With an 8B VLM and an 8B DiT, UniFusion surpasses Flux.1 [dev] and BAGEL on DPG-Bench using a smaller training set, and compares favorably to Flux.1 Kontext [dev] and Qwen-Image-Edit on editing without post-training. Moreover, the unified-encoder framework with LAP yields emergent behaviors, including zero-shot multi-reference generation despite training only on single-reference pairs, and capability transfer where editing training improves text-to-image quality both quantitatively and qualitatively.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 82

Loading