Class Separation Dynamics in Vision Transformers: An Empirical Study

Published: 2025, Last Modified: 12 Nov 2025IEEE Access 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision Transformers (ViTs) have emerged in recent years as a powerful architecture for image classification tasks and beyond. However, due to their black-box nature, our understanding of their learning dynamics is minimal. In this work, we study ViT learning dynamics through the lens of class separation, tracking inter- and intra-class structure, propagating from the patch-embedding layer to the classifier head. Across four datasets and extensive hyper-parameter configurations, we characterize two primary regularities. Firstly, we observe a layer-wise consistency, where separation fuzziness spikes between the input projection and the first transformer block, then decays exponentially with depth. Secondly, provided that the first law holds, fuzziness at the final layer also decays exponentially as a function of the training epochs. The persistence of these laws across datasets and hyper-parameters suggests an intrinsic relationship to ViT training. Our study supplies a quantitative foundation for interpreting the design of transformer-based vision models. Code and reproducibility scripts are available through https://github.com/DaraVaram/vit-data-separation
Loading