Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization

Ermo Hua; Che Jiang; Xingtai Lv; Kaiyan Zhang; Youbang Sun; Yuchen Fan; Xuekai Zhu; Biqing Qi; Ning Ding; Bowen Zhou

Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization

Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, Bowen Zhou

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While prior works mainly address RoPE's limitations within attention, this paper uncovers the adverse effects on length generalization from nearly all parts of LMs. Using *Discrete Signal Processing* theory, we show that RoPE enables periodic attention by implicitly achieving *Non-Uniform Discrete Fourier Transform*. However, this periodicity is undermined by the spectrum damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose ***Fourier Position Embedding (FoPE)***, which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs *Fourier Series* and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales and benchmarks show that, within varying context windows, FoPE maintains a more stable performance compared to other baselines. Several analyses and ablations bring further support to our method and theoretical modeling.

Lay Summary: This paper tackles a key challenge in making language models (like ChatGPT) better at understanding and working with long pieces of text. Most current models struggle when they need to remember and process information that is far apart in the text — for example, something mentioned in the first paragraph and then referred to much later. By borrowing ideas from signal processing (the science behind how we analyze sound waves or radio signals), the authors explain how RoPE works a bit like a radio signal — repeating patterns that help the model stay "in tune" with long text. But other parts of the model can damage this signal, making it harder for the model to perform well on long texts. To fix this, they propose a new method called ***Fourier Position Embedding (FoPE)***. Think of it like giving the model a clearer and more stable rhythm or signal to follow, by removing parts that cause noise or confusion. This helps the model stay better at understanding connections across long stretches of text.

Link To Code: https://github.com/TsinghuaC3I/Fourier-Position-Embedding

Primary Area: Deep Learning->Large Language Models

Keywords: Position Embedding, Length Generalization, Length Extrapolation, Fourier Transform

Submission Number: 9420

Loading