Abstract: Vision transformers have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. However, transformers are known to be data hungry which require orders of magnitude more data [1] to train. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representation of the audio spectrograms. In this paper, we propose Audio-GMML, a self-supervised transformer for general audio representations that is based on Group Masked Model Learning (GMML) and a patch aggregation strategy to improve the performance of learned representations and enforce global structure of the given audio. We evaluate our pretrained models on several downstream tasks, setting a new state-of-the-art performance on five audio and speech classification tasks. The code and pretrained weights will be made publicly available for the scientific community.
Loading