Supplementary Material: zip
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: representation learning, computer vision
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Explain how Masked Distillation (a variant of Masked Image Modeling) improves the model performance.
Abstract: In the realm of self-supervised learning, Masked Image Modeling (MIM) serves as a
viable approach for mitigating the dependency on large-scale annotated data,
while demonstrating efficacy across a broad spectrum of downstream tasks.
A recent variant of MIM known as Masked Distillation (MD) has
emerged, which utilizes semantic features
instead of low-level features as the supervision. Although prior work
has demonstrated its effectiveness in various downstream tasks, the underlying mechanisms
for its performance improvements remain unclear. Our investigation reveals that
Masked Distillation mitigates multiple forms of overfitting present in the
original models, including but not limited to attention homogenization
and the representation folding of high layers. Further, we uncover that
Masked Distillation introduces beneficial inductive biases stemming
from MIM, which are believed to
contribute positively to model performance. We also
analyze the nuances of the model architecture design and decision-making tendencies
in Masked Distillation, revealing inconsistencies with previous research findings.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1920
Loading