A Two-Character Change in Transformer Architecture Promotes Ideal Token Geometry

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformers, Neural Collapse, Attention
TL;DR: We conjecture that ideal transformers collapse tokens, but we find that current transformers don't collapse tokens well, so we modify attention to achieve more collapse and better performance.
Abstract: We hypothesize that in the optimal geometric configuration of token embeddings for transformer classifiers, tokens should collapse to single points according to their classes, and these points themselves should exhibit Neural Collapse. We study whether current transformers achieve this configuration through principal component projections, cosine similarity measurements, analysis of variance on token embeddings, and Neural Collapse measurements, and find that they fall far short of the conjectured ideal. To address this, we introduce a simple modification to attention that brings token embeddings markedly closer to the conjectured configuration and yields consistent performance improvements across benchmarks.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 24030
Loading