The Hidden Attention of Mamba Models

Ameen Ali; Itamar Zimerman; Lior Wolf

The Hidden Attention of Mamba Models

Ameen Ali, Itamar Zimerman, Lior Wolf

Published: 01 Jan 2024, Last Modified: 30 Sept 2024CoRR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The Mamba layer offers an efficient selective state space model (SSM) that is highly effective in modeling multiple domains, including NLP, long-range sequence processing, and computer vision. Selective SSMs are viewed as dual models, in which one trains in parallel on the entire sequence via an IO-aware parallel scan, and deploys in an autoregressive manner. We add a third view and show that such models can be viewed as attention-driven models. This new perspective enables us to empirically and theoretically compare the underlying mechanisms to that of the self-attention layers in transformers and allows us to peer inside the inner workings of the Mamba model with explainability methods. Our code is publicly available.

Loading