TL;DR: The high-norm tokens in LLMs are thoroughly analyzed.
Abstract: Large transformer models are known to produce high-norm tokens. In vision transformers (ViTs), such tokens have been mathematically modeled through the singular vectors of the linear approximations of layers. However, in large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored, and their different properties from those of ViTs require a new analysis framework. In this paper, we provide both theoretical insights and empirical validation across a range of recent models, leading to the following observations: i) The layer-wise singular direction predicts the abrupt explosion of token norms in LLMs. ii) The negative eigenvalues of a layer explain its sudden decay. iii) The computational pathways leading to high-norm tokens differ between initial and noninitial tokens. iv) High-norm tokens are triggered by the right leading singular vector of the matrix approximating the corresponding modules. We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures. Our findings not only advance the understanding of singular defects in LLMs but also open new avenues for their application. We expect that this work will stimulate further research into the internal mechanisms of LLMs. Code is released at https://github.com/haoqiwang/singular_defect.
Lay Summary: Large Language Models (LLMs) are having a pronounced impact on society. Understanding their internal mechanism is thus very important. We gain more understanding of the high-norm phenomenon that appears in almost all contemporary LLMs. In the forward pass, the intermediate activations' norm is not uniformly distributed. Rather, the norms of a few tokens are extremely higher than the norms of other tokens. Interestingly, the directions of all these high-norm tokens are all the same: regardless of what the texts these tokens represent, which layer is in, and what location of the token within the text sequence.
We give a mathematical description covering the full lifecycle of the high-norm phenomenon. (1) We distinguish two types of high-norm tokens, the initial high-norm token and the noninitial high-norm token, and explain their differences in their explosion path. (2) We describe how the explosion of norms happens by introducing the concept of the explosion subspace. (3) We can accurately predict the direction of the high-norm tokens by studying the linear approximation of the transformer layers. (4) We describe the decay of norms by the negative eigenvalues.
Our insights lead to practical applications. First, we spot how the high-norm phenomenon affects the LLM quantization and propose an easy fix. Second, we design the signature of LLMs that can be used to trace the model lineage. The signature distinguishes whether an LLM was fine-tuned from another model and detects model infringement. Ultimately, we believe that understanding singular defects will not only stimulate novel applications but also spur new insights into the internal mechanism of LLMs.
Link To Code: https://github.com/haoqiwang/singular_defect
Primary Area: Deep Learning->Large Language Models
Keywords: Machine Learning, ICML, Singular Defect, Large Language Models, High-Norm Tokens, LLM Quantization, LLM Signature
Submission Number: 935
Loading