- Keywords: Audio representations, self-supervised learning
- Abstract: We explore self-supervision as a way to learn general purpose audio representations. Specifically, we propose self-supervised tasks that exploit the temporal context in the spectrogram domain. The temporal gap task estimates the distance between two short audio segments extracted at random from the same audio clip. The Audio2Vec task is inspired by Word2Vec, a popular technique used to learn word embeddings, and aim at reconstructing a spectrogram slice from past and future slices or, alternatively, at reconstructing the context of surrounding slices from the current slice. We evaluate the quality of the embeddings produced by the self-supervised learning models, measuring the accuracy of linear classifiers, which receive the embeddings as input and aim at addressing a variety of downstream audio tasks. Our results show that the learned representations partially bridge the performance gap with fully supervised models of similar size, and for some tasks even approach their performance.
- Original Pdf: pdf