PARAMETER-EFFICIENT TRANSFER LEARNING OF AUDIO SPECTROGRAM TRANSFORMERS

Umberto Cappellazzo, Daniele Falavigna, Alessio Brutti, Mirco Ravanelli

Published: 25 Sept 2024, Last Modified: 14 Mar 2025MLSP 2025EveryoneRevisionsCC BY 4.0

Abstract: Parameter-efficient transfer learning (PETL) methods have emerged as a solid alternative to the standard full fine-tuning approach. They only train a few extra parameters for each downstream task, without sacrificing performance and dis- pensing with the issue of storing a copy of the pre-trained model for each task. For audio classification tasks, the Audio Spectrogram Transformer (AST) model shows im- pressive results. However, surprisingly, how to efficiently adapt it to several downstream tasks has not been tack- led before. In this paper, we bridge this gap and present a detailed investigation of common PETL methods for the adaptation of the AST model to audio/speech tasks. Fur- thermore, we propose a new adapter design that exploits the convolution module of the Conformer model, leading to superior performance over the standard PETL approaches and surpassing or achieving performance parity with full fine-tuning by updating only 0.29% of the parameters. Fi- nally, we provide ablation studies revealing that our pro- posed adapter: 1) proves to be effective in few-shot efficient transfer learning, 2) attains optimal results regardless of the amount of the allocated parameters, and 3) can be applied to other pre-trained models. Our code is available at https: //github.com/umbertocappellazzo/PETL_AST.