Masked Cross-attention Adapters Enable the Characterization of Dense Features

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: image features, image backbones, ViT, instance segmentation
Abstract: Learning meaningful representations is a core topic of deep learning. Throughout the last decade, many strategies for learning image representations have been proposed involving supervision and self-supervision and various data sources. In most current work, evaluation is focused on classification tasks while neglecting dense prediction tasks, possibly because linear probing is more challenging in the latter case. Furthermore, dense prediction heads are often large and come with specific inductive biases that distort performance measurement further. In this work we propose masked cross-attention adapters (MAXA), a minimal adapter method that is capable of dense prediction independent of the size and resolution of the encoder output. This allows us to make dense predictions using a small number of additional parameters ($<0.3 $%) while allowing for fast training using frozen backbones. Using this adapter, we run a comprehensive evaluation assessing instance awareness, local semantics and spatial representation of a diverse set of backbones. We find that DINOv2 outperforms all other backbones tested - including those supervised with masks and language - across all three task categories. Code is available at https://to.be.released.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9706
Loading