Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations

TMLR Paper4548 Authors

24 Mar 2025 (modified: 24 Apr 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision-language contrastive learning frameworks such as CLIP enable learning representations from natural language supervision and provide strong zero-shot classification capabilities. However, due to the nature of the supervisory signal in these paradigms, they lack the ability to learn localized features, leading to degraded performance on dense prediction tasks such as segmentation and detection. On the other hand, self-supervised learning methods have shown the ability to learn granular representations, complementing the high-level features in vision-language training. In this work, we present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision to learn visual features that can be generalized across different downstream vision tasks. Our framework is specifically designed to work on web-scraped data by not relying on negative examples in the self-supervised learning path and addressing the one-to-one correspondence issue using soft CLIP targets generated by an EMA model. Moreover, Harmony optimizes for five different objectives simultaneously, efficiently utilizing the supervision in each data example, making it even more suited in data-constrained settings. We comprehensively evaluate Harmony across various vision downstream tasks and find that it significantly outperforms the baseline CLIP and outperforms the previously leading joint self- and weakly supervised methods, SLIP, MaskCLIP, and DetailCLIP. Specifically, when compared against these methods, Harmony shows superior performance in linear-probing, fine-tuning, and zero-shot classification on ImageNet-1k, semantic segmentation on ADE20K, and both object detection and instance segmentation on MS-COCO, when pre-training a ViT-B on CC3M. We also show that Harmony outperforms SILC on detection, linear and fine-tuning classification, and outperforms other self-supervised learning methods like iBOT and MAE across all tasks evaluated.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Stephen_Lin1
Submission Number: 4548
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview