Keywords: Attention, State Space Models, Memory, Kernel Methods
Abstract: Recent progress in deep learning is largely driven by advances in deep sequence models. Among them, Transformers and deep SSMs are arguably the two most successful architectures, but with different designs. While Transformers learn where to attend in the context via attention mechanisms, deep SSMs, on the other hand, try to compress and gate context information into fixed-sized, long-range memory states. Hybrid architectures consisting of both attention and SSM layers can achieve superior performance because they address the quadratic scaling and long-range memory issues of attention. Instead of just stacking these two types of layer, in this work, we propose Interdomain Attention which integrates SSMs naturally into attention modules through the lens of kernel methods and finite basis approximation. We argue that Interdomain Attention has the potential to switch or interpolate between the expressiveness and scaling behaviors of Transformers and deep SSMs, and preliminary experiments on sCIFAR-10 show promise.
Submission Number: 8
Loading