Keywords: transformers, translation equivariance, symmetry, foundation models
TL;DR: We observe that in a trained Gemma3 model, 27% of the variance in individual key-query attention coefficients can be attributed to the absolute position of the query in the context window, roughly the same as an untrained model.
Abstract: We present a result that would seem to have remarkable implications for the design of transformers. We observe that in a trained Gemma3 model, 27% of the variance in individual key-query attention coefficients can be attributed to the absolute position of the query in the context window, roughly the same as an untrained model. Training thus produces no move toward translation equivariance.
Submission Number: 37
Loading