Generic Fibers and Functional Dimension of Multi-Head Attention

Published: 24 May 2026, Last Modified: 28 May 2026ICML 2026 Workshop WSS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Symmetries, Weight space symmetries, Identifiability, Self-attention, Functional Dimension
TL;DR: We characterize the generic fibers of multi-head self-attention, prove there are no hidden symmetries beyond the known ones, and show that functional dimension better predicts memorization capacity than parameter count.
Abstract: Weight-space symmetries are pervasive across neural architectures, where symmetries often generate positive dimensional families of parameter tuples that realize the same function. Multi-head self-attention has two known symmetries, including linear changes of basis within Q-K and O-V products and permutations of heads. We prove that for generic parameters, multi-head self-attention has no additional hidden symmetries, characterizing the generic fibers of the parametrization. As a consequence, we derive a formula for the functional dimension of multi-head attention. While an architecture's number of parameters is often used as a measure of expressivity as models are scaled, our functional dimension formula demonstrates when this heuristic is valid and when it is less reliable. We study a setting where the number of attention heads is varied while keeping the number of parameters fixed, and experimentally demonstrate that the model's memorization capacity in bits per parameter varies with the number of heads, but that most of this variation is explained away when the bits memorized is normalized by the functional dimension instead of the number of parameters.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 59
Loading