Abstract: Singing voice transcription (SVT) is the task of converting singing voice music into symbolic note series. Although most SVT models utilized the time-frequency information from the input spectrogram, the useful harmonic information in singing voices has not been utilized enough. In this paper, we propose a novel 3D Cycle Frequency-Harmonic-Time Transformer (CFT) to explicitly capture the harmonic series of singing voices, where we first define a tokenization scheme that captures harmonics across multiple octaves, then the harmonic features are aggregated into the frequency-harmonic-time representations via a cyclic architecture. Results show that our method achieves state-of-the-art performances on several public datasets, including note-wise accuracy increases of 5.76% for MIR-ST500 and 13.56% for Cmedia.
External IDs:dblp:conf/icmcs/WuJLYFD24
Loading