Teacher-generated spatial-attention labels boost robustness and accuracy of contrastive models

Published: 01 Jan 2023, Last Modified: 16 May 2025CVPR 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Human spatial attention conveys information about the regions of visual scenes that are important for performing visual tasks. Prior work has shown that the information about human attention can be leveraged to benefit various supervised vision tasks. Might providing this weak form of supervision be useful for self-supervised representation learning? Addressing this question requires collecting large datasets with human attention labels. Yet, collecting such large scale data is very expensive. To address this challenge, we construct an auxiliary teacher model to predict human attention, trained on a relatively small labeled dataset. This teacher model allows us to generate image (pseudo) attention labels for ImageNet. We then train a model with a primary contrastive objective; to this standard configuration, we add a simple output head trained to predict the attention map for each image, guided by the pseudo labels from teacher model. We measure the quality of learned representations by evaluating classification performance from the frozen learned embeddings as well as performance on image retrieval tasks (see supplementary material). We find that the spatial-attention maps predicted from the contrastive model trained with teacher guidance aligns better with human attention compared to vanilla contrastive models. Moreover, we find that our approach improves classification accuracy and robustness of the contrastive models on ImageNet and ImageNet-C. Further, we find that model representations become more useful for image retrieval task as measured by precision-recall performance on ImageNet, ImageNet-C, CIFAR10, and CIFAR10-C datasets.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview