TL;DR: We introduce a method for near-lossless LLM context window extension.
Abstract: LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window.
This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE.
Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length.
Lay Summary: Large language models (LLMs) often fail when processing very long documents, as they were originally trained on much shorter sequences. Simply extending the context length often hurts performance or requires expensive retraining.
We found that these failures stem from how LLMs encode positional information using a method called RoPE, which struggles when scaled to longer inputs. To address this, we designed a more effective RoPE adjustment method using an evolutionary search guided by a “needle-in-a-haystack” benchmark. We also introduce a mixed training approach: the model uses original RoPE for short sequences and rescaled RoPE for long ones, allowing it to retain strong performance across both.
Our method, LongRoPE2, enables LLaMA3-8B to process inputs up to 128,000 tokens while preserving over 98.5 percent of its original accuracy on short inputs. It achieves this using only 10 billion training tokens, 80 times fewer than Meta’s approach.
Primary Area: Deep Learning->Large Language Models
Keywords: Long-Context LLM, context window extension, RoPE
Submission Number: 6583
Loading