Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation

Published: 01 Jan 2025, Last Modified: 24 Jul 2025CVPR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Contrastive language-image pretraining models such as CLIP have demonstrated remarkable performance in various text-image alignment tasks. However, the inherent 77-token input limitation and reliance on predominantly short-text training data restrict its ability to handle long-text tasks effectively. To overcome these constraints, we propose LongD-CLIP, a dual-teacher distillation framework designed to enhance long-text representation while mitigating knowledge forgetting. In our approach, a teacher model, fine-tuned on long-text data, distills rich representation knowledge into a student model, while the original CLIP serves as a secondary teacher to help the student retain its foundational knowledge. Extensive experiments reveal that LongD-CLIP significantly outperforms existing models across long-text retrieval, short-text retrieval, and zero-shot image classification tasks. For instance, in the image-to-text retrieval task on the ShareGPT4V test set, LongD-CLIP exceeds Long-CLIP's performance by 2.5%, achieving an accuracy of 98.3%. Similarly, on the Urban-1k dataset, it records a 9.2% improvement, reaching 91.9%, thereby underscoring its robust generalization capabilities. Additionally, the text encoder of LongD-CLIP exhibits reduced latent space drift and improved compatibility with existing generative models, effectively overcoming the 77-token input constraint.
Loading