DS-VLM: Diffusion Supervision Vision Language Model

Zhen Sun; Yunhang Shen; Jie Li; Xing Sun; Pingyang Dai; Liujuan Cao; Rongrong Ji

DS-VLM: Diffusion Supervision Vision Language Model

Zhen Sun, Yunhang Shen, Jie Li, Xing Sun, Pingyang Dai, Liujuan Cao, Rongrong Ji

Published: 01 May 2025, Last Modified: 28 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a novel diffuse supervised visual language model (DS-VLM) that directly utilizes the input image to supervise the visual encoder and connector via a diffusion model.

Abstract: Vision-Language Models (VLMs) face two critical limitations in visual representation learning: degraded supervision due to information loss during gradient propagation, and the inherent semantic sparsity of textual supervision compared to visual data. We propose the Diffusion Supervision Vision-Language Model (DS-VLM), a plug-and-play framework that introduces diffusion-based direct supervision for vision-language alignment. By reconstructing input images through a diffusion model conditioned on outputs of the visual encoder and the connector, our method establishes a short-path gradient propagation channel from pixel space to visual features. This approach simultaneously preserves high-level semantic alignment through conventional text supervision while enhancing visual feature quality via pixel-level reconstruction constraints. Extensive experiments conducted across various visual encoders and LLMs of different scales demonstrate the effectiveness of our approach.

Lay Summary: Today’s AI systems that connect pictures and words often miss fine image details because they learn mainly from text feedback that must travel through a very large language model. We created a simple training add-on that asks the computer to “redraw” each picture, pixel by pixel, with a generative “diffusion” engine. This direct exercise gives the vision part of the model much clearer guidance—like letting an art student repaint a scene instead of only hearing comments—while the usual text guidance still keeps words and images in sync. After this training, the same models answer image-based questions more accurately on ten public tests, yet they run just as fast as before because the extra diffusion step is needed only during training. The result is a practical path toward sharper, more reliable visual understanding in everyday AI tools.

Primary Area: Deep Learning->Large Language Models

Keywords: Vision language model;Diffusion Model

Submission Number: 946

Loading