Keywords: Attention, State Space Models, Memory, Kernel Methods, Language Modeling
Abstract: Transformers and deep SSMs sit at opposite ends of a basic design choice: attention learns where to read through query-dependent matching at quadratic cost and a growing KV cache, while deep SSMs compress context into a fixed-size recurrent state at the cost of query-independent readout. We propose Interdomain Attention, which integrates an SSM into an attention module through kernel methods: an attention kernel is approximated by a finite feature map, the resulting key features and values are projected onto a shared set of basis functions maintained by a single SSM recurrence, and each query selects a slice of the stored coefficients through its own feature map, recovering the query-dependent read-out of attention at fixed state. The scalable layer is a learned relaxation of this derivation, and we validate its components through ablations. In a 125M–1.3B autoregressive language-modeling study on FineWeb-Edu at matched recurrent-state budget, Interdomain Attention improves on an SSM token mixer at every scale, surpasses a same-recipe softmax baseline at 1.3B on validation perplexity and on the eight-task commonsense suite, and inherits the length-flat behavior of its fixed-state core out to $3.5\times$ the training context. Ablations indicate that the query-dependent readout is the main source of the gain.
Submission Number: 131
Loading