Abstract: Music Information Retrieval (MIR) tasks on raw audio have traditionally been tackled using convolutional neural networks (CNNs) and transformer-based models. While CNNs effectively capture local structures and transformers leverage attention for long-range dependencies, both architectures come with computational and scalability challenges. Recently, State-Space models, such as Mamba, have become popular for long time series data such as audio, and have shown considerable promise compared to convolutional or transformer-based architectures. In this study, we introduce a novel extension of Mamba tailored to music. Our resulting method, MoMamba (Music Oriented Mamba), is a lightweight Mamba-based music classification model. We evaluate MoMamba's performance across several benchmark MIR tasks. Our results show that MoMamba consistently outperforms a number of baselines, including an existing Mamba-based method, on all of the benchmark datasets we considered. Importantly, all models were trained from scratch without any pretraining, making the performance gains especially notable since they cannot be attributed to transfer learning. Additionally, our model's performance rivals existing benchmarks from models pretrained on much larger datasets. Our work highlights the advantages of MoMamba in music analysis and retrieval such as accuracy and inference time, encouraging further research into its capabilities within the MIR domain.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Tianbao_Yang1
Submission Number: 8104
Loading