From Attention to Activation: Unraveling the Enigmas of Large Language Models

Prannay Kaul; Chengcheng Ma; Ismail Elezi; Jiankang Deng

From Attention to Activation: Unraveling the Enigmas of Large Language Models

Prannay Kaul, Chengcheng Ma, Ismail Elezi, Jiankang Deng

Published: 22 Jan 2025, Last Modified: 09 Apr 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformers, Adam, Optimizer, Outliers, Attention, Quantization

TL;DR: Popular transformer models have strange behaviours including large outlier activations and attention dominance by specific tokens---we fix these with a change to softmax and a suitable change of optimizer.

Abstract: We study two strange phenomena in auto-regressive Transformers: (1) the dominance of the ﬁrst token in attention heads; (2) the occurrence of large outlier activations in the hidden states. We ﬁnd that popular large language models, such as Llama attend maximally to the first token in 98% of attention heads, a behaviour we attribute to the softmax function. To mitigate this issue, we propose a reformulation of softmax to softmax-1. Furthermore, we identify adaptive optimisers, e.g. Adam, as the primary contributor to the large outlier activations and introduce OrthoAdam, a novel optimiser that utilises orthogonal matrices to transform gradients, to address this issue. Finally, not only do our methods prevent these phenomena from occurring, but additionally, they enable Transformers to sustain their performance when quantised using basic algorithms, something that standard methods are unable to do. In summary, our methods reduce the attention proportion on the first token from 65% to 3.3%, the activation kurtosis in the hidden states from 1657 to 3.1, and perplexity penalty under 4-bit weight quantisation from 3565 to 0.3. Code is available at https://github.com/prannaykaul/OrthoAdam

Supplementary Material: pdf

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7191

Loading