PARTICLE: Part Discovery and Contrastive Learning for Fine-grained Recognition

Oindrila Saha; Subhransu Maji

PARTICLE: Part Discovery and Contrastive Learning for Fine-grained Recognition

Oindrila Saha, Subhransu Maji

Published: 31 Jul 2023, Last Modified: 31 Jul 2023VIPriors 2023 OralPosterTBDEveryoneRevisionsBibTeX

Keywords: Self-supervised learning, Part segmentation, Few-shot learning, Fine-grained classification

TL;DR: We propose an approach for fine-tuning models on fine-grained domains by jointly discovering parts and contrastive learning for classification and segmentation tasks.

Abstract: We develop techniques for refining representations for fine-grained classification and segmentation tasks in a self-supervised manner. Current fine-tuning methods based on instance-discriminative contrastive learning are not as effective, possibly due to object pose and background, which are highly discriminatory for instances but act as a nuisance factor for categorization. We present an iterative learning approach that incorporates part-centric equivariance and invariance objectives. First, pixel representations are clustered in a part discovery step, where we analyze the representations from convolutional and vision transformer networks best suited for this. Then, a part-centric learning step aggregates and contrasts representations of parts within an image. We show that this improves the downstream performance on image classification and part segmentation tasks across datasets. For example, under a linear-evaluation scheme, the classification accuracy of a ResNet50 architecture trained using a self-supervised learning approach called DetCon on ImageNet, improves from 35.4% to 42.0% on the Caltech-UCSD birds dataset, from 35.5% to 44.1% on the FGVC aircraft dataset, and from 29.7% to 37.4% on Stanford Cars dataset. We also observe significant gains in few-shot part segmentation tasks in these datasets, while in both cases instance-discriminative learning was not as effective. Smaller, yet consistent, improvements are also observed for stronger baseline models based on vision transformers. We present experiments that evaluate the significance of pre-trained networks and techniques for part-discovery for downstream tasks.

Supplementary Material: zip

Submission Number: 11

Loading