Block Recurrent Dynamics in Vision Transformers

ICLR 2026 Conference Submission21342 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Computer Vision, Interpretability, Dynamical system
TL;DR: We hypothesize vision transformers are block recurrent and validate this by training a recurrent surrogate of DINOv2, requiring only 2 blocks to recover 94% of accuracy. We then study DINOv2 from a dynamical systems perspective.
Abstract: As Vision Transformers (ViTs) become standard backbones across vision, a mechanistic account of their computational phenomenology is now essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the $\textbf{Block-Recurrent Hypothesis (BRH)}$, arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether this reflects reusable computation, we operationalize our hypothesis in the form of block recurrent surrogates of pretrained ViTs, which we call Recurrent Approximations to Phase-structured TransfORmers ($\texttt{Raptor}$). Using small-scale ViTs, we demonstrate that phase-structure metrics correlate with our ability to accurately fit $\texttt{Raptor}$ and identify the role of stochastic depth in promoting the recurrent block structure. We then provide an empirical existence proof for BRH in foundation models by showing that we can train a $\texttt{Raptor}$ model to recover $94$\% of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks. To provide a mechanistic account of these observations, we leverage our hypothesis to develop a program of $\textbf{Dynamical Interpretability}$. We find $\textit{\textbf{(i)}}$ directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations $\textit{\textbf{(ii)}}$ token-specific dynamics, where $\texttt{cls}$ executes sharp late reorientations while $\texttt{patch}$ tokens exhibit strong late-stage coherence reminiscent of a mean-field effect and converge rapidly toward their mean direction and $\textit{\textbf{(iii)}}$ a collapse of the update field to low rank in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find that a compact recurrent program emerges along the depth of ViTs, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.
Primary Area: interpretability and explainable AI
Submission Number: 21342
Loading