DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks

Amin Karimi Monsefi; Kishore Prakash Sailaja; Ali Alilooee; Ser-Nam Lim; Rajiv Ramnath

DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks

Amin Karimi Monsefi, Kishore Prakash Sailaja, Ali Alilooee, Ser-Nam Lim, Rajiv Ramnath

Published: 08 Mar 2025, Last Modified: 31 Mar 2025SSI-FM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Clip, Segmentation, Vision-Language Models, Contrastive Learning, self improving

TL;DR: A self-improving vision-language model that enhances fine-grained understanding through self-curated objectives

Abstract: In this paper, we introduce DetailCLIP, a self-improving vision-language foundation model designed to enhance fine-grained feature understanding through self-supervised learning. Foundation models like CLIP have demonstrated strong performance in global image-text alignment but often fail to capture detail-oriented features necessary for tasks such as segmentation. To address this, DetailCLIP integrates self-curated learning objectives that iteratively improve both high-level semantics and detailed visual representations. Specifically, our method employs patch-level self-distillation and pixel-level reconstruction losses to generate refined internal representations, while an attention-based token filtering mechanism curates semantically relevant information during training. By generating and refining self-curated learning signals, DetailCLIP improves segmentation performance and demonstrates superior generalization across diverse tasks. These task-agnostic objectives position DetailCLIP as a self-improving foundation model, enhancing multi-modal systems like CLIP with fine-grained feature understanding.

Submission Number: 66

Loading