SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling

Krishna C Puvvada; Faisal Ladhak; Santiago Akle Serano; Cheng-Ping Hsieh; Shantanu Acharya; Somshubra Majumdar; Fei Jia; Samuel Kriman; Simeng Sun; Dima Rekesh; Boris Ginsburg

SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling

Krishna C Puvvada, Faisal Ladhak, Santiago Akle Serano, Cheng-Ping Hsieh, Shantanu Acharya, Somshubra Majumdar, Fei Jia, Samuel Kriman, Simeng Sun, Dima Rekesh, Boris Ginsburg

Published: 10 Jun 2025, Last Modified: 28 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformer, LLM, Long-context, RoPE, Sliding window

Abstract: We present SWAN-GPT, a decoder-only Transformer architecture that generalizes to sequence lengths substantially longer than those seen during training. SWAN-GPT interleaves layers without positional encodings (NoPE) and sliding-window attention layers with rotary positional encodings (SWA-RoPE). Our experiments demonstrate strong performance on sequences significantly longer than the training length without specialized long-context training. This robust length extrapolation is achieved through our novel architecture, enhanced by dynamic scaling of attention scores during inference. Additionally, SWAN-GPT is more computationally efficient than standard GPT architectures, and existing pre-trained models can be efficiently converted to the SWAN architecture with minimal continued training.

Submission Number: 23

Loading