A Transformer Approach for Polyphonic Audio-to-Score Transcription

María Alfaro-Contreras, Antonio Ríos-Vila, Jose J. Valero-Mas, Jorge Calvo-Zaragoza

Published: 01 Jan 2024, Last Modified: 13 Nov 2024ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: End-to-end Audio-to-Score (A2S) transcription aims to derive a score that represents the music content of an audio recording in a single step. While current state-of-the-art methods, which rely on Convolutional Recurrent Neural Networks trained with the Connectionist Temporal Classification loss function, have shown promising results under constrained circumstances, these approaches still exhibit fundamental limitations, especially when dealing with complex sequence modeling tasks, such as polyphonic music. To address these conditions, this work introduces an alternative learning scheme based on a Transformer decoder, specifically tailored for A2S by incorporating a two-dimensional positional encoding to preserve frequency-time relationships when processing the audio signal. The results obtained over three datasets of polyphonic string music confirm the adequacy of the method, which improves the transcription rate by an average of 44% compared to previous approaches.