Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters

Umberto Cappellazzo, Daniele Falavigna, Alessio Brutti

Published: 02 Sept 2024, Last Modified: 14 Mar 2025Interspeech 2024EveryoneCC BY 4.0

Abstract: Mixture of Experts (MoE) architectures have recently started burgeoning due to their ability to scale model’s capac- ity while maintaining the computational cost affordable, leading to state-of-the-art results in numerous fields. While MoE has been mostly investigated for the pre-training stage, its use in parameter-efficient transfer learning (PETL) settings is under- explored. To narrow this gap, this paper attempts to demystify the use of MoE for PETL of Audio Spectrogram Transformers to audio and speech downstream tasks. Specifically, we pro- pose Soft Mixture of Adapters (Soft-MoA). It exploits adapters as the experts and, leveraging the recent Soft MoE method, it relies on a soft assignment between the input tokens and ex- perts to keep the computational time limited. Extensive experi- ments across 4 benchmarks demonstrate that Soft-MoA outper- forms the single adapter method and performs on par with the dense MoA counterpart. We finally present ablation studies on key elements of Soft-MoA. Our code is available at https: //github.com/umbertocappellazzo/PETL_AST.