Simplifying DINO via Coding Rate Regularization

Ziyang Wu; Jingyuan Zhang; Druv Pai; XuDong Wang; Chandan Singh; Jianwei Yang; Jianfeng Gao; Yi Ma

Simplifying DINO via Coding Rate Regularization

Ziyang Wu, Jingyuan Zhang, Druv Pai, XuDong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, Yi Ma

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a methodology that simplifies and improves DINO, a widely used self-supervised learning algorithm.

Abstract: DINO and DINOv2 are two model families being widely used to learn representations from unlabeled imagery data at large scales. Their learned representations often enable state-of-the-art performance for downstream tasks, such as image classification and segmentation. However, they employ many empirically motivated design choices and their training pipelines are highly complex and unstable --- many hyperparameters need to be carefully tuned to ensure that the representations do not collapse --- which poses considerable difficulty to improving them or adapting them to new domains. In this work, we posit that we can remove most such-motivated idiosyncrasies in the pre-training pipelines, and only need to add an explicit coding rate term in the loss function to avoid collapse of the representations. As a result, we obtain highly simplified variants of the DINO and DINOv2 which we call SimDINO and SimDINOv2, respectively. Remarkably, these simplified models are more robust to different design choices, such as network architecture and hyperparameters, and they learn even higher-quality representations, measured by performance on downstream tasks, offering a Pareto improvement over the corresponding DINO and DINOv2 models. This work highlights the potential of using simplifying design principles to improve the empirical practice of deep learning. Code and model checkpoints are available at https://github.com/RobinWu218/SimDINO.

Lay Summary: Modern AI systems learn from large amount of data without requiring labels by training themselves to recognize patterns. In the case of visual AI systems, a popular method is called DINO, which has been very successful but is also highly complex and difficult to train --- it requires a lot of manual tweaking and has many technical components to prevent it from learning meaningless features. In this work, we show that much of this complexity isn’t necessary. We propose simplified versions of DINO, called SimDINO and SimDINOv2, which replace the complicated parts with a principled mathematical objective that encourages the model to learn diverse and informative features. This makes the training process significantly easier and more stable. Surprisingly, despite being simpler, our models actually learn better image representations than the original DINO models. This means they perform better on tasks like image classification, object detection, and segmentation --- all without the training headaches. Our work suggests that simpler AI models can be both more powerful and more practical.

Link To Code: https://github.com/RobinWu218/SimDINO

Primary Area: Deep Learning->Self-Supervised Learning

Keywords: Self-supervised learning, DINO

Submission Number: 13457

Loading