Vision Language Model Distillation Using Partial Information Decomposition

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Language Model, Model Distillation, Partial Information Decomposition
TL;DR: Vision Language Model Distillation
Abstract: Vision-Language Models (VLMs) have achieved remarkable success by integrating visual and textual modalities, enabling advancements in tasks like image captioning and multimodal retrieval. However, the substantial computational cost and large model sizes hinder their deployment in resource-constrained environments. This paper introduces a novel approach to VLM distillation by incorporating synergetic information—capturing emergent properties from the interaction between visual and textual modalities—into the distillation framework. By leveraging Partial Information Decomposition (PID), we decompose mutual information into unique, redundant, and synergistic components, explicitly optimizing the student model to retain critical multimodal interactions. Our proposed framework integrates contrastive loss, KL divergence, L2 regularization, and a synergy term into the total loss function. Experimental results demonstrate that incorporating synergetic information significantly enhances retrieval performance across image-to-text and text-to-image tasks compared to traditional distillation approaches. Although the student model (ResNet-34 with a 2-layer transformer) lags behind the teacher model (CLIP ViT-B/16) due to differences in capacity and lack of pretraining, the proposed method consistently narrows the performance gap. This work highlights the importance of synergetic information in VLM distillation and sets a foundation for future exploration into scaling student models, pretraining strategies, and optimizing synergy-driven objectives. Our findings underscore the transformative potential of synergy in developing lightweight, efficient VLMs without compromising multimodal understanding and performance.
Submission Number: 65
Loading