Group Masked Model Learning for General Audio Representation

Sara Atito, Muhammad Awais, Tony Alex, Josef Kittler

Published: 01 Jan 2023, Last Modified: 06 May 2025ICIP 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Vision transformers have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. However, transformers are known to be data hungry which require orders of magnitude more data [1] to train. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representation of the audio spectrograms. In this paper, we propose Audio-GMML, a self-supervised transformer for general audio representations that is based on Group Masked Model Learning (GMML) and a patch aggregation strategy to improve the performance of learned representations and enforce global structure of the given audio. We evaluate our pretrained models on several downstream tasks, setting a new state-of-the-art performance on five audio and speech classification tasks. The code and pretrained weights will be made publicly available for the scientific community.