On the Emergence of Position Bias in Transformers

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We show that causal masking biases attention toward earlier tokens as layers deepen, while relative positional encodings balance distance-based decay with early-position dominance, providing deeper insights into position biases in transformers.
Abstract: Recent studies have revealed various manifestations of position bias in transformer architectures, from the "lost-in-the-middle" phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masks and positional encodings shape these biases remains elusive. This paper presents a graph-theoretic framework for analyzing position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokens interact with contextual information based on their sequential positions. We uncover two key insights: First, causal masking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly more contextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask and relative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanisms introduce distance-based decay within individual attention maps, their aggregate effect across multiple attention layers—coupled with the causal mask—leads to a trade-off between the long-term decay effects and the cumulative importance of early sequence positions. Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundation for understanding positional biases in transformers, shedding light on the complex interplay of attention mechanism components and guiding more informed architectural design.
Lay Summary: Large language models (LLMs), like those behind chatbots and code assistants, often pay too much attention to the beginning or end of a passage while overlooking the middle—a phenomenon known as position bias. This behavior can reduce model reliability on tasks involving long documents, source code, or complex reasoning. In our work, we developed a new way to study this issue by modeling how information flows between words in a transformer—the backbone of most modern LLMs—as a graph. This graph-theoretic perspective revealed that two key components of transformers, the causal mask and positional encodings, push the model's attention in different directions. The causal mask steers focus toward earlier words, while positional encodings emphasize nearby words instead. Through theoretical analysis and controlled experiments, we show how these components interact to produce the position biases observed in real-world LLMs. Our framework provides a principled foundation for diagnosing, understanding, and mitigating these biases, paving the way for more balanced and reliable model behavior in future LLM designs.
Link To Code: https://github.com/xinyiwu98/position-bias-in-attention
Primary Area: Deep Learning->Attention Mechanisms
Keywords: attention mechanism, transformers, position bias, positional encoding, deep learning theory
Submission Number: 5569
Loading