From Local to Contextually-Enriched Local Representations: A Mechanism for Holistic Processing in DINOv2 ViTs
Keywords: Vision transformers, Understanding high-level properties of models, Probing
TL;DR: Mechanistic analysis shows DINOv2’s mid-layers keep content locally anchored, strengthen positional alignment, and integrate global context via a mixture of short- and long-range attention heads, thereby supporting holistic processing.
Abstract: Self-supervised Vision Transformers (ViTs) such as DINOv2 achieve robust holistic shape processing, but the transformations that support this ability remain unclear. Probing with visual anagrams, we find that DINOv2’s intermediate layers constitute a necessary stage for holistic vision. Our analyses reveal a structured sequence of computations. First, local content representations remain spatially anchored deeper into the network than in supervised ViTs. Second, attention heads progressively extend their range, producing a systematic local-to-global transition, facilitating local representations that are contextually enriched. Third, positional signals are not merely lost with depth but become more sharply aligned with the model’s learned positional embeddings in mid-level layers. Models without these properties, such as supervised ViTs, rapidly lose spatially specific content and fail on holistic tasks. Finally, when register tokens are present, high-norm global activations are redirected into these tokens rather than overwriting low-information patch embeddings, allowing patches to maintain their positional identity, also leading to improvements on holistic tasks. Together, these findings show that holistic vision in ViTs emerges from a structured progression of representational transformations that preserve both content and spatial information while enabling global integration.
Submission Number: 172
Loading