Abstract: The paradigm of pretraining a backbone on a large set of (often unlabeled) images has gained popularity. The quality of the resulting features is commonly measured by freezing the backbone and training different task heads on top of it. However, current evaluations cover only classifications of whole images or require complex dense task heads which introduce a large number of parameters and add their own inductive biases. In this work, we propose dense attentive probing, a parameter-efficient readout to make dense prediction using arbitrary backbones independent of the size and resolution of their feature volume. To this end, we utilize a masked cross-attention layer with learnable mask sizes which enables dense prediction with a small parameter budget, thus providing relatively unbiased access to the features. We employ this method to evaluate common backbones in three dimensions: instance awareness, local semantics and spatial understanding. We find that DINOv2 outperforms all other backbones tested -- including those supervised with masks and language -- across all three task categories. Furthermore, our analysis suggests that self-supervised training tends to yield features that separate object instances better than vision-language models. Code is available at https://to.be.released.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yanwei_Fu2
Submission Number: 4586
Loading