How Training Window Length Shapes Neural Language Model Weights

Published: 11 Jun 2026, Last Modified: 12 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Feature Geometry, Applications of interpretability
TL;DR: Training identical language models across 10 context window lengths reveals a stable weight distance plateau, initialization-dependent displacement directions, transfer asymmetry, and null frequency condensation.
Abstract: We study how the training context window length $w$ is associated with changes in the learned weights of language models. By training identical architectures across three families (Transformer, 81.1M; GRU, 39.5M; RetNet, 41.5M) on ten window lengths ($w \in \{128, \ldots, 65536\}$) with a fixed 500M-token budget, we characterize the geometry of window-induced weight changes. Our primary finding: in a matched-step regime where all models receive identical optimizer steps and tokens per step, adjacent-window final-weight angular distance forms an approximate plateau ($\sim$0.175, corresponding to $\sim$31.5$^\circ$), spanning a $32\times$ range from $w=2048$ to $w=65{,}536$. This regularity is reproduced across two independent Transformer seeds with near-identical magnitudes. Two qualifications matter. Window-effect vectors (the weight displacement induced by doubling $w$) are completely orthogonal across seeds (cosine $\approx 0$): the magnitude of the window-induced change is reproducible, but its direction in weight space is initialization-dependent. Consecutive window doublings within the same seed produce anti-correlated displacement vectors (cosine $\approx -0.45$), though we show this is largely a geometric artifact of finite differences sharing a middle term. Functional validation via output KL divergence at multiple evaluation lengths provides evidence that adjacent-window models are moderately more similar than cross-seed models (sym-KL $\approx 0.27$ vs.\ $0.31$). Zero-shot weight transfer reveals strong asymmetry: short-to-long fails catastrophically while long-to-short degrades substantially. Brief fine-tuning (100 steps) closes the gap, suggesting the specialization is shallow in the loss landscape. RoPE frequency utilization is uniform across all windows in both final weights and learned updates ($N_{\text{eff}} \approx 32$ per head), providing no evidence for norm-level frequency condensation.
Submission Number: 107
Loading