Dissecting Representation Structure in Vision Transformers: A Rigorous Architectural Study

Published: 24 Apr 2026, Last Modified: 01 Jun 2026VisCon 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Representation, Vision Transformer, Feature
TL;DR: A rigorous analysis of feature information across diverse architectural scales, empirically uncover the relationship between ViT representation and generalization behavior, and leverage these insights to guide efficient ViT design
Abstract: Representation structure is crucial for understanding Vision Transformer (ViT) architectures and their generalization behavior. However, prior studies neither isolate nor analyze module-level features nor investigate how their interactions contribute to performance estimation. In this work, we conduct a rigorous analysis of feature information across diverse architectural scales, empirically uncover the relationship between ViT representation and generalization behavior, and leverage these insights to guide efficient ViT design. Our contributions are fivefold: Across diverse architectural scales, 1) We identify feature collapse at initialization, which leads to redundancy, and propose a reduction scheme to mitigate this issue. 2) We quantify feature information using entropy and the minimum eigenvalue, demonstrating that these metrics serve as reliable indicators for generalization prediction. 3) We show that feature in the token space provides a more faithful representation than those in embedding space. 4) We discover an unexpected finding: features produced by linear submodules within ViT layers are critical for the prediction of generalization performance. 5) Our proposed proxy improves the correlation ranking by 18-48\% over prior baselines and can effectively identify ViT architectures that achieve higher accuracy at lower or comparable computational cost.
Submission Number: 18
Loading