Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing

Timo Lüddecke; Alexander S. Ecker

Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing

Timo Lüddecke, Alexander S. Ecker

Published: 20 Sept 2025, Last Modified: 20 Sept 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Authors that are also TMLR Expert Reviewers: ~Alexander_S._Ecker1

Abstract: The paradigm of pretraining a backbone on a large set of (often unlabeled) images has gained popularity. The quality of the resulting features is commonly measured by freezing the backbone and training different task heads on top of it. However, current evaluations cover only classifications of whole images or require complex dense task heads which introduce a large number of parameters and add their own inductive biases. In this work, we propose dense attentive probing, a parameter-efficient readout method for dense prediction on arbitrary backbones – independent of the size and resolution of their feature volume. To this end, we extend cross-attention with distance-based masks of learnable sizes. We employ this method to evaluate 18 common backbones on dense predictions tasks in three dimensions: instance awareness, local semantics and spatial understanding. We find that DINOv2 outperforms all other backbones tested – including those supervised with masks and language – across all three task categories. Furthermore, our analysis suggests that self-supervised pretraining tends to yield features that separate object instances better than vision-language models. Code is available at http://eckerlab.org/code/deap.

Certifications: Expert Certification

Submission Length: Regular submission (no more than 12 pages of main content)

Code: http://eckerlab.org/code/deap

Assigned Action Editor: ~Yanwei_Fu2

Submission Number: 4586

Loading