Diffusion Instruction Tuning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Lavender: an efficient supervised fine-tuning (SFT) approach boosting SoTA vision-language model with Stable Diffusion Model
Abstract: We introduce *Lavender*, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model’s visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples---2.5\% of typical large-scale SFT datasets---and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30\% gains and a 68\% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. Code, training data, and models are available on the [project page](https://astrazeneca.github.io/vlm/).
Lay Summary: AI models that can understand both images and text are powerful, but they are often held back by a lack of training data. We wondered if we could improve these vision-language models (VLMs) by borrowing knowledge from a different kind of AI: one that is an expert at *generating* images, like Stable Diffusion. We noticed that image generators have a very precise internal map of how words correspond to specific image regions. Our method, which we call *Lavender*, uses these highly accurate "attention maps" as a guide to fine-tune the VLM. Essentially, we teach the VLM to "look" at the image with the same focus as the expert image generator, enriching its visual understanding. This alignment provides a powerful and efficient training signal. The results were significant: models improved by up to 30% on general benchmarks and even 68% on a challenging medical question-answering task. Remarkably, this was achieved with just 2.5% of the data typically used for this kind of training and in only a single day. Lavender offers a scalable way to build more robust and capable vision-language systems by bridging two expert AI paradigms.
Link To Code: https://astrazeneca.github.io/vlm/
Primary Area: Deep Learning->Foundation Models
Keywords: vision-language models, diffusion models, image-to-text generation, text-to-image alignment, cross-attention, visual instruction tuning, attention alignment, out-of-distribution robustness, data-efficient training, unified multimodal systems, transformer fine-tuning, vision-language generalization
Submission Number: 2134
Loading