Keywords: attention mechanism, transformers, position bias, positional encoding, deep learning theory
TL;DR: We show that causal masking biases attention toward earlier tokens as layers deepen, while relative positional encodings balance distance-based decay with early-position dominance, providing deeper insights into position biases in transformers.
Abstract: Recent studies have revealed various manifestations of position bias in transformer architectures, from the “lost-in-
the-middle” phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masks
and positional encodings shape these biases remains elusive. This paper presents a graph-theoretic framework for
analyzing position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokens
interact with contextual information based on their sequential positions. We uncover two key insights: First, causal
masking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly more
contextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask and
relative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanisms
introduce distance-based decay within individual attention maps, their aggregate effect across multiple attention
layers—coupled with the causal mask—leads to a trade-off between the long-term decay effects and the cumulative
importance of early sequence positions. Through controlled numerical experiments, we not only validate our theoretical
findings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundation
for understanding positional biases in transformers, shedding light on the complex interplay of attention mechanism
components and guiding more informed architectural design.
Code: zip
Submission Number: 62
Loading