FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-Supervised Learning, Vision Foundation Model, Robustness, Efficient Training, Curriculum Learning
TL;DR: In this work, we show that vision foundation models such as DINOv2 can achieve fast convergence and maintain high robustness by applying data curriculum and integrating data augmentation in the frequency domain during pretraining.
Abstract: Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. The expense of large-scale pre-training puts such research out of reach for many, hence limiting scientific advancements. We thus propose a novel pretraining strategy for DINOv2 that simultaneously accelerates convergence–and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum–low-frequency being seen first–and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time is reduced by 1.6×–from 16.64 to 10.32 NVIDIA L40S days–and FLOPs by 2.25×, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with the DINOv2 baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as a means to improve self-supervised learning models robustness.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 28196
Loading