MLLMCLIP: Feature-Level Distillation of MLLM for Robust Vision-Language Representations

18 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: MLLM, CLIP, Distillation, Representation Learning
Abstract: Large-scale pretrained vision–language models, such as CLIP, have become the backbone of modern zero-shot recognition. Despite their strong generalization ability, these models often struggle with compositionality, particularly in understanding attribute-object combinations and relational structures. Recent studies mitigate this issue by augmenting training with synthetic hard negatives generated by large language models and text-to-image models. Yet, this strategy relies on separate expert models, introducing a sequential generation pipeline with quality-control overhead and resulting in a disjointed source of multimodal understanding. To overcome these limitations, we propose MLLMCLIP, a feature-level distillation framework that bypasses synthetic data generation by directly transferring multimodal knowledge from Multimodal Large Language Model~(MLLM). Our framework addresses the key challenges of cross-architecture distillation with three core contributions: (1) a question-answering-based protocol to select the teacher MLLM, (2) an attention-based method to identify salient teacher tokens, and (3) the successful adaptation of Centered Kernel Alignment for stable knowledge transfer. MLLMCLIP achieves state-of-the-art performance on 9 out of 11 compositionality benchmarks, while also yielding significant improvements in general-purpose tasks, such as zero-shot classification and image-text retrieval.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 11473
Loading