Keywords: Self-Supervised Learning, Fine-Grained Learning
Abstract: Self-supervised learning (SSL) has achieved strong results on coarse-grained tasks
but often struggles with fine-grained recognition, where categories differ only by
subtle local cues. For strong downstream transfer, features must form compact
within-class clusters with large inter-class margins at the fine level. However,
standard SSL losses either over-separate visually similar subcategories by treating
all non-positives as equally negative, or overlook part-based evidence and thus
merge them under coarse prototypes. We propose a multi-level regularization
framework that improves clustering across granularities. At the global level, a soft
variant of InfoNCE reduces false negatives and enhances class separation. At the
part level, clustering on local descriptors preserves subtle intra-class distinctions.
At the instance level, semantic descriptions from vision–language models provide
attribute-level anchors. Together, these components yield representations with
balanced clustering across granularities. Experiments on CUB200-2011, Stanford
Cars, and FGVC-Aircraft show consistent improvements in both classification and
retrieval, validating the effectiveness of our approach for fine-grained SSL.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9223
Loading