Cycle Frequency-Harmonic-Time Transformer for Note-Level Singing Voice Transcription

Yulun Wu, Yaolong Ju, Simon Lui, Jing Yang, Fan Fan, Xuhao Du

Published: 2024, Last Modified: 23 Mar 2026ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Singing voice transcription (SVT) is the task of converting singing voice music into symbolic note series. Although most SVT models utilized the time-frequency information from the input spectrogram, the useful harmonic information in singing voices has not been utilized enough. In this paper, we propose a novel 3D Cycle Frequency-Harmonic-Time Transformer (CFT) to explicitly capture the harmonic series of singing voices, where we first define a tokenization scheme that captures harmonics across multiple octaves, then the harmonic features are aggregated into the frequency-harmonic-time representations via a cyclic architecture. Results show that our method achieves state-of-the-art performances on several public datasets, including note-wise accuracy increases of 5.76% for MIR-ST500 and 13.56% for Cmedia.
Loading