Understanding Self-Supervised Pretraining with Part-Aware Representation Learning
Abstract: In this paper, we are interested in understanding self-supervised pretraining through studying the capability that self-supervised methods learn part-aware representations. The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts. We explain that contrastive learning is a part-to-whole task: the projection layer hallucinates the whole object representation from the object part representation learned from the encoder, and that masked image modeling is a part-to-part task: the masked patches of the object are hallucinated from the visible patches. The explanation suggests that the self-supervised pretrained encoder leans toward understanding the object part. We empirically compare the off-the-shelf encoders pretrained with several representative methods on object-level recognition and part-level recognition. The results show that the fully-supervised model outperforms self-supervised models for object-level recognition, and most self-supervised contrastive learning and masked image modeling methods outperform the fully-supervised method for part-level recognition. It is observed that the combination of contrastive learning and masked image modeling further improves the performance.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We incorporate the content of rebuttal into this submission including: - Add the object-level retrieval experiment - Ablate with different learning rates - Make a preliminary experiment on non-object centric datasets - Add more discussion about related work - Explain the idea behind our part resizing approach - Add more details to clarify points of confusion - Fix the presentation issues
Assigned Action Editor: ~Simon_Kornblith1
Submission Number: 972