Recycling Pretrained Classification Heads for Efficient Vision-Language Alignment

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision–Language Alignment, Vision and Language Encoders, Zero‑Shot Transfer, Image‑to‑Text Retrieval
TL;DR: A data-efficient vision-language alignment method that recycles typically discarded ImageNet-21K classification head weights as semantic prototypes, combining them with minimal image-text pairs.
Abstract: Vision-Language Models (VLMs) with separate image and text encoders, such as CLIP, excel at tasks like zero-shot classification or cross-modal retrieval. They achieve this by embedding images and text into a shared representation space. However, their success relies on end-to-end training with large volumes of paired samples, entailing prohibitive data and computational costs. Existing post-hoc vision-language alignment methods, which map independently trained image and text encoders into a shared representation space using lightweight functions, reduce training costs but still require substantial paired data. We introduce a data augmentation approach that recycles classification head weights from ImageNet-21K pretraining and combines them with a reduced number of image-text pairs to achieve vision-language alignment. These recycled weights significantly mitigate the need for large alignment datasets, while the combination with a reduced number of image-text pairs extends alignment beyond the original ImageNet domain. We demonstrate that integrating our augmentation approach with several state-of-the-art post-hoc alignment techniques consistently boosts accuracy in cross-modal retrieval, zero- and few-shot classification tasks. Experiments confirm that our approach provides a versatile and data-efficient solution for vision-language representation alignment.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 9545
Loading