A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention

Published: 16 Jun 2024, Last Modified: 10 Jul 2024HiLD at ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: replica method, statistical physics, phase transition, high-dimensional limit, attention layer
TL;DR: We provide a tight asymptotic analysis of the learning of an attention layer, and evidence a phase transition from a positional to a semantic attention mechanism with sample complexity.
Abstract: A theoretical understanding of how algorithmic abilities emerge in the learning of language models remains elusive. In this work, we provide a tight theoretical analysis of the emergence of semantic attention in a solvable model of dot-product attention and consider a non-linear self-attention layer with trainable tied and low-rank query and key matrices. In the asymptotic limit of high-dimensional data and a comparably large number of training samples we provide a tight closed-form characterization of the global minimum of the non-convex empirical loss landscape. We show that this minimum corresponds to either a positional attention mechanism (with tokens attending to each other based on their respective positions) or a semantic attention mechanism (with tokens attending to each other based on their meaning), and evidence an emergent phase transition from the former to the latter with increasing sample complexity.
Student Paper: Yes
Submission Number: 7
Loading