microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification
Keywords: Vision-language Models, Unsupervised Adaptation, Fine-grained Visual Recognition
Abstract: Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification is challenging because pseudo-labels must be inferred from tiny local cues. CLIP’s contrastive pretraining yields strong zero-shot transfer, but its coarse-grained $\texttt{[CLS]}$ visual token often misses fine spatial details. Existing UA methods mainly enrich text prompts with large language model (LLM) descriptions while still relying on $\texttt{[CLS]}$, which limits spatial precision. We propose $\textbf{microCLIP}$, a label-free self-training framework that jointly adapts visual features and LLM-derived text prototypes using fine-grained cues. First, a Saliency-Oriented Attention Pooling (SOAP) mechanism inside a lightweight TokenFusion module constructs a saliency-guided $\texttt{[FG]}$ token from patch features and fuses it with the global $\texttt{[CLS]}$ representation for coarse–fine alignment. Second, a two-headed LLM-derived classifier combines a frozen text head, used with multi-view CLIP features as a stable prior for pseudo-labels, with a learnable head initialized from the same LLM-derived descriptions. Finally, Dynamic Knowledge Aggregation convexly combines fixed CLIP/LLM priors with TokenFusion logits during self-training. Across 13 diverse classification benchmarks including fine-grained ones, microCLIP yields a mean improvement of $\mathbf{+2.90\\%}$ points over the strongest prior UA baseline while fine-tuning only layer norms and a tiny head.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal pretraining, image text matching, cross-modal application, multimodality, self-supervised learning, transfer learning / domain adaptation, representation learning
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 1275
Loading