Length-Induced Embedding Collapse in Transformer-based Models

Yuqi Zhou; Sunhao Dai; Zhanshuo Cao; Xiao Zhang; Jun Xu

Length-Induced Embedding Collapse in Transformer-based Models

Yuqi Zhou, Sunhao Dai, Zhanshuo Cao, Xiao Zhang, Jun Xu

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Embedding Models, Length Collapse, Mechanistic Interpretability

TL;DR: This paper identifies a phenomenon called "Length Collapse," where text embeddings degrade in performance on long texts due to excessive low-pass filtering in the self-attention mechanism.

Abstract: Text embeddings enable various applications, but their performance deteriorates on longer texts. In this paper, we find that the performance degradation is due to a phenomenon called \textbf{Length Collapse}, where longer text embeddings collapse into a narrow space. This collapse results in a distributional inconsistency between embeddings of different text lengths, ultimately hurting the performance of downstream tasks. Theoretically, by considering the self-attention mechanism inherently functions as a low-pass filter, we prove that long sequences increase the attenuation rate of the low-pass filter effect of the self-attention mechanism. With layers going deeper, excessive low-pass filtering causes the token signals to retain only their Direct-Current (DC) component, which means the input token feature maps will collapse into a narrow space, especially in long texts. Based on the above analysis, we propose to mitigate the undesirable length collapse limitation by introducing a temperature in $\softmax(\cdot)$, which achieves a higher low-filter attenuation rate. The tuning-free method, called \textbf{TempScale}, can be plugged into multiple transformer-based embedding models. Empirically, we demonstrate that TempScale can improve existing embedding models especially on long text inputs, bringing up to \textbf{0.53\%} performance gains on 40 datasets from Massive Text Embedding Benchmark (MTEB) and \textbf{0.82\%} performance gains on 4 datasets from LongEmbed, which specifically focuses on long context retrieval. The source code is available at \textcolor{blue}{\url{https://anonymous.4open.science/r/Length_Collapse-22D2}}.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9553

Loading