Keywords: MLLM, CLIP, Distillation, Representation Learning
Abstract: Large-scale pretrained vision–language models, such as CLIP, have become the backbone of modern zero-shot recognition.
Despite their strong generalization ability, these models often struggle with compositionality, particularly in understanding attribute-object combinations and relational structures.
Recent studies mitigate this issue by augmenting training with synthetic hard negatives generated by large language models and text-to-image models.
Yet, this strategy relies on separate expert models, introducing a sequential generation pipeline with quality-control overhead and resulting in a disjointed source of multimodal understanding.
To overcome these limitations, we propose MLLMCLIP, a feature-level distillation framework that bypasses synthetic data generation by directly transferring multimodal knowledge from Multimodal Large Language Model~(MLLM).
Our framework addresses the key challenges of cross-architecture distillation with three core contributions:
(1) a question-answering-based protocol to select the teacher MLLM,
(2) an attention-based method to identify salient teacher tokens, and
(3) the successful adaptation of Centered Kernel Alignment for stable knowledge transfer.
MLLMCLIP achieves state-of-the-art performance on 9 out of 11 compositionality benchmarks, while also yielding significant improvements in general-purpose tasks, such as zero-shot classification and image-text retrieval.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 11473
Loading