Position Augmentation: Reducing RoPE Extrapolation Cliffs via Random Position Scaling During Training

Zacharie Bugaud

Position Augmentation: Reducing RoPE Extrapolation Cliffs via Random Position Scaling During Training

Zacharie Bugaud

Published: 26 May 2026, Last Modified: 31 May 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Position Encoding, RoPE, Length Generalization, Transformers, Long Context

TL;DR: Multiplying RoPE position indices by a random scalar during training reduces the extrapolation cliff by 14-43x at 42M-113M Chinchilla-ratio scales with <=1.4% in-distribution penalty and <0.5% wall-clock overhead.

Abstract: Transformer language models with Rotary Position Embeddings (RoPE) suffer significant performance degradation when evaluated on sequences longer than their training context window. We propose Position Augmentation (PosAug), a training-time intervention that multiplies all position indices by a random scalar $\alpha \sim U[a,b]$ at each gradient step. Unlike randomized position encodings, PosAug preserves uniform spacing between adjacent tokens while exposing the model to a range of effective RoPE frequency scales. At Chinchilla-ratio training budgets ($\sim$20 tokens/parameter) with $n=3$ seeds, PosAug reduces the extrapolation cliff by $43\times$ at 42M parameters (cliff $2.65 \to 0.06$) and $14\times$ at 113M ($2.84 \to 0.20$), with $\leq 1.4\%$ in-distribution penalty and $<0.5\%$ wall-clock overhead. The method composes with inference-time scaling (PosAug+YaRN: $G(16\text{K})=2.99$ at 1B, though this model is undertrained and $G(C)$ may be inflated by the sliding-window evaluation approximation) and with longer context windows (PosAug $w=4096$: cliff $0.06$). The positional range exposed during training follows the approximate heuristic $L_{\max} \approx b \times w$. In a separate ablation setting, a wider step curriculum further reduces the cliff compared with standard PosAug. Preliminary 425M experiments (1B tokens, curriculum variant) show $4.9\times$ cliff reduction from baseline with $1.2\%$ penalty, though these are not Chinchilla-ratio. The residual cliff is higher at 113M (0.20) than at 42M (0.06) under Chinchilla-ratio training, so we do not yet know whether PosAug continues to hold up at larger scales.

Submission Number: 72

Loading