On the Existence of Hidden Subnetworks Within a Randomly Weighted Multi-Head Attention Mechanism

Published: 09 Jun 2025, Last Modified: 09 Jun 2025HiLD at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Strong Lottery Ticket Hypothesis, Neural Network Pruning, Random Neural Network, Transformer
TL;DR: We analyze subnetworks within a randomly weighted multi-head attention mechanism.
Abstract: The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks are hidden in randomly initialized neural networks. Although recent theoretical works have established the existence of such subnetworks across various neural architectures, the existence of SLTs in transformer architectures has only been observed empirically and lacks theoretical understanding. In particular, the current SLTH theory does not yet account for the multi-head self-attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within the attention mechanism. Given $H$ heads, we prove that an arbitrary target MHA can be approximated by suitably pruning the randomly initialized MHA with the key and value dimensions $O(d\log(Hd^{3/2}))$, where $d$ is the dimension of the input and output. We further empirically validate our theoretical findings, demonstrating that an SLT within a random MHA of logarithmically wider hidden dimensions can approximate the performance of trained counterparts.
Student Paper: Yes
Submission Number: 29
Loading