Vision Transformers Secretly Crave Noise

ICLR 2026 Conference Submission16330 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-supervised Learning, Vision Transformer, Fine-tuning, Transfer Learning
Abstract: Data augmentation and regularization have proven to be fundamental techniques for enhancing the generalization of deep neural networks. While canonical methods such as RandAug, CutMix, Mixup, RandErase, and DropPath offer diverse regularization effects, their combined use appears to have reached a saturation point, leaving little room for further performance gains. In this work, we introduce DiffNoise, a novel data augmentation strategy that injects smooth noise-based perturbations into the input embedding space rather than directly into the raw input. Contrary to the conventional belief, DiffNoise performs orthogonally to existing data augmentations, improving the standard recipe that has largely reached saturation. This improvement may be interpreted as expanding the augmentation space along a previously unexplored axis, without any architectural modifications or auxiliary objectives. Furthermore, DiffNoise implicitly benefits from a more improved localization capability and learn generalized, robust representations across various models. Extensive experiments across a wide spectrum of model families—including ViTs, CLIP, and self-supervised architectures—show that DiffNoise consistently enhances performance across multiple downstream tasks. Code is available in the Supplementary Material.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16330
Loading