Abstract: Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual
representations. However, the out-of-the-box performance
of MIMs is typically inferior to competing approaches. Most
users cannot afford fine-tuning due to the need for large
amounts of data, high GPU consumption, and specialized
user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason
for the poor out-of-the-box performance of MIMs. Is it due
to weaker features produced by MIM models, or is it due
to suboptimal usage? Through detailed analysis, we show
that attention in MIMs is spread almost uniformly over many
patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective Aggregation
to better capture the rich semantic information retained in
patch tokens, which significantly improves the out-of-the-box
performance of MIM.
Loading