On the Information Geometry of Vision Transformers

Sonia Joseph; Kumar Krishna Agrawal; Arna Ghosh; Blake Aaron Richards

On the Information Geometry of Vision Transformers

Sonia Joseph, Kumar Krishna Agrawal, Arna Ghosh, Blake Aaron Richards

Published: 29 Nov 2023, Last Modified: 29 Nov 2023NeurReps 2023 PosterEveryoneRevisionsBibTeX

Submission Track: Extended Abstract

Keywords: Vision Transformers, Computer Vision, Eigenspectrum Decay, Token Representations, Information Geometry

TL;DR: We study the representation geometry at the token and sequence level across transformer block in ViTs.

Abstract: Understanding the structure of high-dimensional representations learned by Vision Transformers (ViTs) provides a pathway toward developing a mechanistic understanding and further improving architecture design. In this work, we leverage tools from information geometry to characterize representation quality at a per-token (intra-token) level as well as across pairs of tokens (inter-token) in ViTs pretrained for object classification. In particular, we observe that these high-dimensional tokens exhibit a characteristic spectral decay in the feature covariance matrix. By measuring the rate of this decay (denoted by $\alpha$) for each token across transformer blocks, we discover an $\alpha$ signature, indicative of a transition from lower to higher effective dimensionality. We also demonstrate that tokens can be clustered based on their $\alpha$ signature, revealing that tokens corresponding to nearby spatial patches of the original image exhibit similar $\alpha$ trajectories. Furthermore, for measuring the complexity at the sequence level, we aggregate the correlation between pairs of tokens independently at each transformer block. A higher average correlation indicates a significant overlap between token representations and lower effective complexity. Notably, we observe a U-shaped trend across the model hierarchy, suggesting that token representations are more expressive in the intermediate blocks. Our findings provide a framework for understanding information processing in ViTs while providing tools to prune/merge tokens across blocks, thereby making the architectures more efficient.

Submission Number: 69

Loading