Beyond the Efficiency-Performance Trade-off: Semantic Foundation Attention

ICLR 2026 Conference Submission21475 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Semantic Foundation Attention, Attention Mechanism, Computational Efficiency, Semantic Reconstruction, Token Merging
Abstract: The quadratic computational complexity of self-attention presents a significant challenge for scaling Transformer architectures to longer sequences. While existing approaches pursue efficiency through sparse approximation or hardware optimization, they operate under the assumption that the input token sequence remains immutable. We propose Semantic Foundation Attention (SFA), which introduces semantic reconstruction—a paradigm that dynamically reconfigures the computational structure based on semantic relationships during attention computation. SFA employs two complementary strategies: similarity merging consolidates semantically aligned tokens through vector addition to preserve and amplify signal strength, while difference merging exploits orthogonality properties in high-dimensional embedding spaces to efficiently integrate complementary information. We implement custom CUDA compute kernels for SFA that decompose the generated dynamic attention patterns into diagonal and rectangular computation domains, enabling efficient execution without explicitly storing the sparse matrix. Comprehensive evaluation on OLMoE architectures demonstrates that SFA consistently improves performance across multiple downstream benchmarks while reducing computational requirements. These results show that computational efficiency and model performance can be jointly optimized through semantically-aware attention computation, establishing semantic reconstruction as a viable paradigm for attention mechanism design.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21475
Loading