On the Information Geometry of Vision Transformers

Published: 29 Nov 2023, Last Modified: 29 Nov 2023NeurReps 2023 PosterEveryoneRevisionsBibTeX
Submission Track: Extended Abstract
Keywords: Vision Transformers, Computer Vision, Eigenspectrum Decay, Token Representations, Information Geometry
TL;DR: We study the representation geometry at the token and sequence level across transformer block in ViTs.
Abstract: Understanding the structure of high-dimensional representations learned by Vision Transformers (ViTs) provides a pathway toward developing a mechanistic understanding and further improving architecture design. In this work, we leverage tools from information geometry to characterize representation quality at a per-token (intra-token) level as well as across pairs of tokens (inter-token) in ViTs pretrained for object classification. In particular, we observe that these high-dimensional tokens exhibit a characteristic spectral decay in the feature covariance matrix. By measuring the rate of this decay (denoted by $\alpha$) for each token across transformer blocks, we discover an $\alpha$ signature, indicative of a transition from lower to higher effective dimensionality. We also demonstrate that tokens can be clustered based on their $\alpha$ signature, revealing that tokens corresponding to nearby spatial patches of the original image exhibit similar $\alpha$ trajectories. Furthermore, for measuring the complexity at the sequence level, we aggregate the correlation between pairs of tokens independently at each transformer block. A higher average correlation indicates a significant overlap between token representations and lower effective complexity. Notably, we observe a U-shaped trend across the model hierarchy, suggesting that token representations are more expressive in the intermediate blocks. Our findings provide a framework for understanding information processing in ViTs while providing tools to prune/merge tokens across blocks, thereby making the architectures more efficient.
Submission Number: 69