Understanding Input Selectivity in Mamba: Impact on Approximation Power, Memorization, and Associative Recall Capacity
TL;DR: In this work, we demystify the role of input selectivity in Mamba, investigating how it impacts its function approximation power, long-term memory, and associative recall capabilities.
Abstract: State-Space Models (SSMs), and particularly Mamba, have recently emerged as a promising alternative to Transformers. Mamba introduces input selectivity to its SSM layer (S6) and incorporates convolution and gating into its block definition. While these modifications do improve Mamba's performance over its SSM predecessors, it remains largely unclear how Mamba leverages the additional functionalities provided by input selectivity, and how these interact with the other operations in the Mamba architecture. In this work, we demystify the role of input selectivity in Mamba, investigating its impact on function approximation power, long-term memorization, and associative recall capabilities.
In particular: (i) we prove that the S6 layer of Mamba can represent projections onto *Haar wavelets*, providing an edge over its Diagonal SSM (S4D) predecessor in approximating discontinuous functions commonly arising in practice; (ii) we show how the S6 layer can dynamically counteract memory decay; (iii) we provide analytical solutions to the MQAR associative recall task using the Mamba architecture with different mixers --- Mamba, Mamba-2, and S4D. We demonstrate the tightness of our theoretical constructions with empirical results on concrete tasks. Our findings offer a mechanistic understanding of Mamba and reveal opportunities for improvement.
Lay Summary: Mamba is a new architecture capable of rivaling Transformers in language modelling. The main novelty of Mamba lies in the use of a mechanism called input selectivity for processing information. But why is it so effective? And is that the sole reason behind its performance?
In this work we analyze the impact of input selectivity and other components of Mamba in boosting its expressive power. We do so in three ways: by describing what types of functions Mamba can approximate; by illustrating how Mamba can effectively memorize information; and by showing how its components can act in concert to solve some associative-recall tasks.
Overall, our work provides a better understanding of the inner workings of Mamba, and points to ways in which this architecture could be further improved.
Primary Area: Deep Learning->Theory
Keywords: Mamba, State Space Model, Input selectivity, Function Approximation, Memorization, MQAR
Submission Number: 11760
Loading