Keywords: Transformer, LLM, Long-context, RoPE, Sliding window
Abstract: We present SWAN-GPT, a decoder-only Transformer architecture that generalizes to sequence lengths substantially longer than those seen during training. SWAN-GPT interleaves layers without positional encodings (NoPE) and sliding-window attention layers with rotary positional encodings (SWA-RoPE). Our experiments demonstrate strong performance on sequences significantly longer than the training length without specialized long-context training. This robust length extrapolation is achieved through our novel architecture, enhanced by dynamic scaling of attention scores during inference. Additionally, SWAN-GPT is more computationally efficient than standard GPT architectures, and existing pre-trained models can be efficiently converted to the SWAN architecture with minimal continued training.
Submission Number: 23
Loading