Keywords: Representation Learning, Self Supervised Learning, Coding Rate
Abstract: DINO and DINOv2 are two methods widely used for learning representations of unlabeled imagery data at large scales. Their learned representations often give state-of-the-art performance for downstream tasks, such as image classification and segmentation. However, their training pipelines are highly complex and unstable, so it is difficult to improve and adapt them to new domains. In particular, they employ many empirically motivated design choices and carefully tuned hyperparameters to ensure that the representations do not collapse. In this work, we posit that we can remove many such-motivated idiosyncracies in the pre-training pipelines, and only need to add an explicit coding rate term in the loss function to avoid collapse of the representations. As a result, we obtain highly simplified variants of the DINO and DINOv2 model families, which we call SimDINO and SimDINOv2, respectively. Notably, their training pipelines are more robust to different design choices, such as network architecture and hyperparameters, and they learn even higher-quality representations, measured by performance on downstream tasks, offering a Pareto improvement over the corresponding DINO and DINOv2 model families. This work highlights the potential of using simplifying design principles to improve empirical outcomes in deep learning.
Submission Number: 64
Loading