A Clustering Baseline for Object-Centric Representations

Federico Baldassarre; Josselin Somerville Roberts; Huy V. Vo; Maxime Oquab; Piotr Bojanowski

A Clustering Baseline for Object-Centric Representations

Federico Baldassarre, Josselin Somerville Roberts, Huy V. Vo, Maxime Oquab, Piotr Bojanowski

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: object-centric representations, self-supervised learning

TL;DR: Clustering DINOv2 features with k-means produces better object-centric representations than sota slot-based models, eg for scene and video classification. Object masks are more general (overlaps), capture objects/parts, but are less pixel-perfect.

Abstract: Object-centric learning aims to discover and represent visual entities as a small set of object embeddings and masks, which can be later used for downstream tasks. Recent methods for object-centric learning build upon vision foundation models trained with self supervision because of the rich semantic features they produce. However, they often involve additional training to optimize for object mask accuracy for a specific granularity of objects on a test dataset, while overlooking the evaluation of the quality of the object embeddings which is arguably more important. In this work, we demonstrate how to discover objects and parts with a simple multi-scale application of k-means to the features of an off-the-shelf backbone. Our method is fast and flexible, produces interpretable masks, preserves the quality of the backbone embeddings, does not require additional training, and can capture different part/whole structures. We evaluate the quality of the obtained representation on a variety of downstream tasks including scene classification and action recognition in videos, showing that it surpasses the performance of fine-tuned object-centric learning methods. Object masks produced by our method also effectively capture real-world objects and parts at various granularity, with comparable quality to specialized methods when evaluated on unsupervised segmentation benchmarks. These results suggest rethinking the current approach to object-centric learning, with a greater focus on the quality of the representation.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11626

Loading