Dense Representation Learning for a Joint-Embedding Predictive Architecture

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Self-supervised Learning, Joint-Embedding Predictive Architecture, Masked Image Modeling
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: In this paper, we introduce Dense-JEPA, a novel dense representation learning framework with a Joint-Embedding Predictive Architecture.
Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantage lies in the inadequate grasp of local semantics for dense representations, a shortfall stemming from its masked modeling on the embedding space and the consequent in less discriminative or even missing local semantics. To bridge this gap, we introduce Dense-JEPA, a novel masked modeling objective rooted in JEPA, tailored for enhanced dense representation learning. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed Dense-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, Dense-JEPA learns better dense representations, which can be beneficial to a wide range of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1032
Loading