Keywords: object-centric learning, vision pretraining, weakly-supervised learning
TL;DR: We propose a weakly-supervised pretraining approach for vision foundation models that shifts the self-distillation granularity from whole images to individual objects.
Abstract: Self-distillation has become a central paradigm for pretraining vision transformers (ViTs). Existing approaches typically operate at the image level and assume that different augmentations of the same image preserve semantic content to be distilled. This premise breaks down in complex scenes with multiple objects with randomly sampled data augmentations. To tackle this, we introduce ODIS (Object-level Self-Distillation), a new framework that refines the self-distillation objective to the level of individual objects using bounding boxes that encapsulate objects. ODIS leverages object-aware cropping to ensure that teacher and student views depict the same object, and employs masked attention to focus the learning signal on objects. Applied to ImageNet-1K, ODIS outperforms image-level distillation methods such as iBOT across both image-level and patch-level benchmarks, and its features transfer better to downstream classification and retrieval tasks. Moreover, ODIS is robust to bounding box noise: using two different off-the-shelf extractors, it consistently improves over SOTA baselines. Our results highlight the importance of object-centric supervision in scalable representation learning and demonstrate how pretrained tools can be integrated into distillation pipelines to enhance generalization.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 25025
Loading