microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

ACL ARR 2026 January Submission1275 Authors

29 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-language Models, Unsupervised Adaptation, Fine-grained Visual Recognition

Abstract: Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification is challenging because pseudo-labels must be inferred from tiny local cues. CLIP’s contrastive pretraining yields strong zero-shot transfer, but its coarse-grained $\texttt{[CLS]}$ visual token often misses fine spatial details. Existing UA methods mainly enrich text prompts with large language model (LLM) descriptions while still relying on $\texttt{[CLS]}$, which limits spatial precision. We propose $\textbf{microCLIP}$, a label-free self-training framework that jointly adapts visual features and LLM-derived text prototypes using fine-grained cues. First, a Saliency-Oriented Attention Pooling (SOAP) mechanism inside a lightweight TokenFusion module constructs a saliency-guided $\texttt{[FG]}$ token from patch features and fuses it with the global $\texttt{[CLS]}$ representation for coarse–fine alignment. Second, a two-headed LLM-derived classifier combines a frozen text head, used with multi-view CLIP features as a stable prior for pseudo-labels, with a learnable head initialized from the same LLM-derived descriptions. Finally, Dynamic Knowledge Aggregation convexly combines fixed CLIP/LLM priors with TokenFusion logits during self-training. Across 13 diverse classification benchmarks including fine-grained ones, microCLIP yields a mean improvement of $\mathbf{+2.90\\%}$ points over the strongest prior UA baseline while fine-tuning only layer norms and a tiny head.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal pretraining, image text matching, cross-modal application, multimodality, self-supervised learning, transfer learning / domain adaptation, representation learning

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 1275

Loading