Keywords: Transformers; Extrapolation; Structural attention
Abstract: Transformers form the backbone of modern large language models, but their long-context performance is limited by the dilution effect: attention mass spreads uniformly across distant positions, failing to maintain structural dependencies. Existing solutions, such as sparse or efficient attention patterns, improve efficiency but do not address the lack of structural anchoring. We introduce the Structural-Former (S-Former), which maintains a parallel structural stream that evolves recurrently to track sequential patterns independently of token content and provides structural-like anchors for attention. Unlike compressed state-space models, our approach maintains explicit structural representations that remain orthogonal to semantic content. We study two integration mechanisms: (i) attention fusion, which validates the decoupling principle by showing that the structural gate $\alpha_t$ tracks bracket depth in Dyck languages; and (ii) bias injection, a minimal and stable design that adds the structural signal into hidden activations. Synthetic probes (Markov, Dyck and JSON) demonstrate that the structural stream learns hierarchical and sequential rules beyond surface statistics. On WikiText-103, S-Former extrapolates stably to long contexts, reducing perplexity degradation by 76% when extrapolating to 40k tokens. These findings suggest that introducing a recurrent structural stream provides a lightweight and scalable inductive bias that substantially improves long-context extrapolation, offering a complementary direction to sparse attention or memory-based methods.
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 17278
Loading