Interpreting the Repeated Token Phenomenon in Large Language Models

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We found the reason why LLMs struggle with repeating the same word many times and propose a way to fix it.
Abstract: Large Language Models (LLMs), despite their impressive capabilities, often fail to accurately repeat a single word when prompted to, and instead output unrelated text. This unexplained failure mode represents a *vulnerability*, allowing even end users to diverge models away from their intended behavior. We aim to explain the causes for this phenomenon and link it to the concept of "attention sinks", an emergent LLM behavior crucial for fluency, in which the initial token receives disproportionately high attention scores. Our investigation identifies the neural circuit responsible for attention sinks and shows how long repetitions disrupt this circuit. We extend this finding to other nonrepeating sequences that exhibit similar circuit disruptions. To address this, we propose a targeted patch that effectively resolves the issue without negatively impacting the overall performance of the model. This study provides a mechanistic explanation for an LLM vulnerability, demonstrating how interpretability can diagnose and address issues, and offering insights that pave the way for more secure and reliable models.
Lay Summary: Large Language Models (LLMs) often struggle to repeat a single word, and generate unrelated text instead. This "repeated token divergence" is a significant vulnerability, allowing models to deviate from intended behavior. Our research explains this by linking it to "attention sinks", an LLM behavior crucial for fluency. We used mechanistic interpretability, analyzing the neural circuits underlying attention sinks. We found a two-stage mechanism: an initial attention layer marks the first token, and a later neuron amplifies its hidden state, creating the attention sink. When repeated tokens are present, the first attention layer mistakenly marks both initial and subsequent identical tokens, leading to abnormally high attention and model divergence. This study offers a mechanistic explanation for an LLM vulnerability, demonstrating how interpretability can diagnose and address issues, paving the way for more secure and reliable models.
Primary Area: Deep Learning->Large Language Models
Keywords: Attention Sinks, LLM, Repeated tokens, First Token, LLMs, Privacy, Security
Submission Number: 4967
Loading