Training Fails to Promote Translation Equivariance in Transformers

NeurIPS 2025 Workshop NeurReps Submission37 Authors

25 Aug 2025 (modified: 29 Oct 2025)Submitted to NeurReps 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: transformers, translation equivariance, symmetry, foundation models
TL;DR: We observe that in a trained Gemma3 model, 27% of the variance in individual key-query attention coefficients can be attributed to the absolute position of the query in the context window, roughly the same as an untrained model.
Abstract: We present a result that would seem to have remarkable implications for the design of transformers. We observe that in a trained Gemma3 model, 27% of the variance in individual key-query attention coefficients can be attributed to the absolute position of the query in the context window, roughly the same as an untrained model. Training thus produces no move toward translation equivariance.
Submission Number: 37
Loading