Hidden-Layer Self-Distillation Yields Drift-Resilient Visual Representations

Published: 28 Feb 2026, Last Modified: 04 Apr 2026CAO PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: self-supervised learning, SSL, MAE, JEPA, robustness, TTA, SAR
TL;DR: SSL yields more robust representations if you predict representations at multiple levels of abstraction as pretraining task
Abstract: The choice of pretraining strategy has a direct impact on how well visual representations withstand distribution shift at deployment. We study Bootleg, a recent self-supervised method that predicts continuous latent representations from multiple hidden layers of an EMA teacher network, spanning early stimulus-driven features to late semantic features. This multi-scale objective forces representations to encode both fine-grained spatial detail and high-level semantics. We evaluate Bootleg against MAE, CrossMAE, data2vec 2.0, and I-JEPA across three ViT scales (S, B, L) on distribution-shift benchmarks. We find that Bootleg pretraining yields best or second best robustness across all model sizes. We further show that Bootleg representations respond well to test-time adaptation with SAR, yielding the largest accuracy gains under corruption shift. These results suggest that grounding SSL targets across the network hierarchy is a promising strategy for drift-resilient representation learning.
Submission Number: 127
Loading