I’m Sorry for Your Loss: Spectrally-Based Audio Distances Are Bad at PitchDownload PDF

Published: 09 Dec 2020, Last Modified: 05 May 2023ICBINB 2020 PosterReaders: Everyone
Keywords: pitch, auditory, generative models, dissimilarity ranking, neural networks, loss, loss function, spectral centroid, gradient, gradient-based learning, deep learning, audio similarity, perceptual distance
TL;DR: Commonly used audio-to-audio loss functions are pitch-blind, thus confounding their use as optimization criteria.
Abstract: Growing research demonstrates that synthetic failure modes imply poor generalization. We compare commonly used audio-to-audio losses on a synthetic benchmark, measuring the pitch distance between two stationary sinusoids. The results are surprising: many have poor sense of pitch direction. These shortcomings are exposed using simple rank assumptions. Our task is trivial for humans but difficult for these audio distances, suggesting significant progress can be made in self-supervised audio learning by improving current losses.
1 Reply

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview