Self-Supervised Learning of Intertwined Content and Positional Features for Object Detection

Kang-Jun Liu; Masanori Suganuma; Takayuki Okatani

Self-Supervised Learning of Intertwined Content and Positional Features for Object Detection

Kang-Jun Liu, Masanori Suganuma, Takayuki Okatani

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present a novel self-supervised feature learning method using Vision Transformers (ViT) as the backbone, specifically designed for object detection and instance segmentation. Our approach addresses the challenge of extracting features that capture both class and positional information, which are crucial for these tasks. The method introduces two key components: (1) a positional encoding tied to the cropping process in contrastive learning, which utilizes a novel vector field representation for positional embeddings; and (2) masking and prediction, similar to conventional Masked Image Modeling (MIM), applied in parallel to both content and positional embeddings of image patches. These components enable the effective learning of intertwined content and positional features. We evaluate our method against state-of-the-art approaches, pre-training on ImageNet-1K and fine-tuning on downstream tasks. Our method outperforms the state-of-the-art SSL methods on the COCO object detection benchmark, achieving significant improvements with fewer pre-training epochs. These results suggest that better integration of positional information into self-supervised learning can improve performance on the dense prediction tasks.

Lay Summary: Before tackling real‑world tasks like self‑driving or medical scans, vision systems take a label‑free warm‑up on millions of unlabeled pictures. Typical label‑free warm‑ups teach them only what objects look like and postpone where they appear to later modules. Our idea is to blend both what and where during this warm‑up, linking them early without any human annotations. During this warm‑up, we let the visual system play a hide-and-seek game. We cut random windows from each photo, with shuffling and scaling. Next, we hide some pieces or their positions and challenge the system to figure out what’s missing and where it belongs. This challenge weaves content and position tightly inside the vision system. After a short tuning step with labels, the system draws tighter boxes and cleaner masks than previous label‑free warm‑up methods. Because our recipe alters only the learning process, not the system design, it works as a plug‑and‑play upgrade that delivers accurate results with short training.

Link To Code: https://github.com/KJ-rc/IntertwinedSSL

Primary Area: Deep Learning->Self-Supervised Learning

Keywords: Self-supervised learning; Instance segmentation pre-training; Object detection pre-training; Vision transformer

Submission Number: 1974

Loading