SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling

Published: 10 Jun 2025, Last Modified: 10 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer, LLM, Long-context, RoPE, Sliding window
Abstract: We present SWAN-GPT, a decoder-only Transformer architecture that generalizes to sequence lengths substantially longer than those seen during training. SWAN-GPT interleaves layers without positional encodings (NoPE) and sliding-window attention layers with rotary positional encodings (SWA-RoPE). Our experiments demonstrate strong performance on sequences significantly longer than the training length without specialized long-context training. This robust length extrapolation is achieved through our novel architecture, enhanced by dynamic scaling of attention scores during inference. Additionally, SWAN-GPT is more computationally efficient than standard GPT architectures, and existing pre-trained models can be efficiently converted to the SWAN architecture with minimal continued training.
Submission Number: 23
Loading