TL;DR: We a novel alignment paradigm that enables multimodal in-context reasoning capabilities in text-to-image diffusion models by integrating vision-language models.
Abstract: This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the **LLM decoder** shares the same input feature space with **diffusion decoders** that use the corresponding **LLM encoder** for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.
Lay Summary: Can we train a unified model that excels at both multimodal understanding and in-context image generation? Common diffusion models are good at generating images strictly following text or image prompts. However, they often fall short when it comes to understanding and reasoning about complex input contexts, especially when combining multiple images and texts.
Our work, ThinkDiff, introduces a new approach that empowers image generation models with multimodal in-context understanding and reasoning. Instead of relying on scarce and complex reasoning datasets to train diffusion models, we align vision-language models (VLMs) with the decoder of a large language model (LLM), which shares a common feature space with diffusion decoders. This proxy task allows us to transfer reasoning ability from the VLMs to the diffusion models without direct diffusion training.
With only 5 hours of training on 4 A100 GPUs, ThinkDiff effectively unifies the understanding, reasoning, and generation capabilities in one model.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/MiZhenxing/ThinkDiff
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Diffusion model, Multimodal in-context reasoning, Vision-language model, Vision-language training
Submission Number: 4756
Loading