DKP: Semantic Consistency Distillation via Key-layer Pre-alignment with Large Vision-language Models for Cross-Modal Retrieval
Keywords: Semantic Consistency Distillation, Key-layer Pre-alignment, Large Vision-language Models, Cross-Modal Retrieval
TL;DR: Semantic Consistency Distillation via Key-layer Pre-alignment with Large Vision-language Models for Cross-Modal Retrieval
Abstract: Recent retrieval solutions based on large vision-language models (VLMs) have shown promising performance by aligning vision and language representations for image-text retrieval (ITR). However, most methods rely solely on final-layer features, overlooking the rich semantic patterns embedded in intermediate layers. In this paper, we propose Semantic Consistency \textbf{\underline{D}}istillation via \textbf{\underline{K}}ey-layer \textbf{\underline{P}}re-alignment (termed as DKP), a novel paradigm that enhances cross-modal retrieval by leveraging intermediate knowledge of VLMs. Specifically, we introduce (i) Key-layer Pre-alignment (KPA) to identify and align the most semantically meaningful intermediate features across modalities, and (ii) Semantic Consistency Distillation (SCD) to regularize cross-modal learning via intra-modal structure. Extensive experiments on Flickr30K and MS-COCO validate DKP significantly boosts retrieval performance while requiring over 60\% fewer learnable parameters and significantly less computational cost, without introducing additional supervision or external knowledge. The anonymous code is available at: \textcolor{blue}{\url{https://anonymous.4open.science/r/DKP}}.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7518
Loading