Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

ICLR 2026 Conference Submission13056 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Positional Encoding, Large Language Model, Transformer, Long Context, Attention Mechanism
TL;DR: We introduce BAM, a Bayesian theoretical framework for positional encoding (PE) that interprets existing methods as prior distributions within Bayesian attention, enabling significant improvements in transformers' long-context performance.
Abstract: Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization by more than $25\times$ in retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13056
Loading