Keywords: Audio representations, self-supervised learning
Abstract: We explore self-supervision as a way to learn general purpose audio
representations. Specifically, we propose self-supervised tasks that exploit the
temporal context in the spectrogram domain. The temporal gap task estimates
the distance between two short audio segments extracted at random from the same
audio clip. The Audio2Vec task is inspired by Word2Vec, a popular
technique used to learn word embeddings, and aim at reconstructing a spectrogram
slice from past and future slices or, alternatively, at reconstructing the
context of surrounding slices from the current slice. We evaluate the quality of
the embeddings produced by the self-supervised learning models, measuring the
accuracy of linear classifiers, which receive the embeddings as input and aim at
addressing a variety of downstream audio tasks. Our results show that the
learned representations partially bridge the performance gap with fully
supervised models of similar size, and for some tasks even approach their
performance.
Original Pdf: pdf
4 Replies
Loading