Unsupervised Knowledge Distillation via Local Representations for Vision-Language Models

Published: 2025, Last Modified: 05 Nov 2025IEEE Signal Process. Lett. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent vision-language models (VLMs) have adopted prompt learning for downstream adaptation, yet they often suffer from poor generalization and require labeled data. Unsupervised knowledge distillation (UKD) offers a promising alternative, but existing methods primarily distill global predictions, neglecting the potential of local representations for fine-grained recognition. Interestingly, we observe that large VLMs such as CLIP-L/14 yield strong global features but produce weak and noisy local tokens. To address this, we propose a two-stage UKD framework that introduces an assistant model—comparable in size to the student—to refine and transfer reliable local cues under teacher supervision. Extensive experiments on 11 benchmarks demonstrate that our method consistently outperforms global-only distillation approaches in domain generalization and unseen class recognition tasks.
Loading