LS-CLIP: Autoencoder-Based Mining of CLIP's Inherent Local Semantics in Cross-Domain Image Retrieval

LS-CLIP: Autoencoder-Based Mining of CLIP's Inherent Local Semantics in Cross-Domain Image Retrieval

ICLR 2026 Conference Submission17362 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Cross-Domain Image Retrieval; CLIP; Autoencoder; Local Semantics; Pluggable

TL;DR: LS-CLIP improves CLIP's performance on specialized tasks by using an autoencoder approach and feature moment transfer to mine local semantic information and enhance the model's generalization ability.

Abstract: Contrastive Language-Image Pretraining (CLIP) excels in cross-domain image retrieval. However, existing methods often depend on extensive manual annotations for local supervision and neglect CLIP's native local-semantic capabilities. To address these problems, we propose an autoencoder-based approach named LS-CLIP, which is designed to mine local semantics in CLIP and realize cross-domain feature alignment. First, we design a self-supervised Semantic Reconstruction Module (SRM) for local feature mining. Reconstructing the patch features of the Vision Transformer (ViT), SRM integrates global and local semantic perception, enabling it to adapt to retrieval tasks of different granularities. Second, we introduce Feature Moment Transfer (FMT). Through the reconstruction of cross-domain features via moment transfer, the stability of the feature space is enhanced. In addition, this module incorporates noise to reconstruct the data distribution, thereby improving the model's generalization ability. To accommodate diverse retrieval intents, we construct a dataset with rich textual descriptions and a wide range of scenarios, named CDIR-Flickr30k. Extensive experiments demonstrate that LS-CLIP significantly outperforms state-of-the-art baseline models in various metrics. Zero-shot evaluation confirms its strong generalization capability. Importantly, LS-CLIP can be applied as a plug-and-play model to CLIP variants, consistently delivering performance improvements.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 17362

Loading