From Local to Contextually-Enriched Local Representations: A Mechanism for Holistic Processing in DINOv2 ViTs

Fenil R. Doshi; Thomas Fel; Talia Konkle; George A. Alvarez

From Local to Contextually-Enriched Local Representations: A Mechanism for Holistic Processing in DINOv2 ViTs

Fenil R. Doshi, Thomas Fel, Talia Konkle, George A. Alvarez

Published: 30 Sept 2025, Last Modified: 22 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision transformers, Understanding high-level properties of models, Probing

TL;DR: Mechanistic analysis shows DINOv2’s mid-layers keep content locally anchored and integrate global context via a mixture of short- and long-range attention heads, thereby supporting holistic processing.

Abstract: Self-supervised Vision Transformers (ViTs) such as DINOv2 achieve robust holistic shape processing, but the transformations that support this ability remain unclear. Probing with visual anagrams, we find that DINOv2’s intermediate layers constitute a necessary stage for holistic vision. Our analyses reveal a structured sequence of computations. First, attention heads progressively extend their range, producing a systematic local-to-global transition. Second, content information of patches becomes more contextually enriched with depth. Third, positional signals are not merely lost with depth but are retained in mid-level layers. Models without these properties, such as supervised ViTs, fail on holistic tasks. Finally, when register tokens are present, high-norm global activations are redirected into these tokens rather than overwriting low-information patch embeddings, allowing patches to maintain their positional identity, also leading to improvements on holistic tasks. Together, these findings show that holistic vision in ViTs emerges from a structured progression of representational transformations that preserve both content and spatial information while enabling global integration.

Submission Number: 172

Loading