Abstract: We introduce MIR-MLPop, a publicly available multilingual pop music dataset designed for automatic lyrics transcription and lyrics alignment in polyphonic music. The dataset comprises 90 pop music tracks in Mandarin, Cantonese, and Taiwanese Hokkien, with manually annotated time-aligned lyrics with both characters and pronunciation labels. To the best of our knowledge, this is the first ever singing dataset for Cantonese and Taiwanese Hokkien. In the experiments, using the pretrained Whisper model as the backbone, we develop lyrics transcription and lyrics alignment models for all three languages. Overall, the results are promising for both tasks, but show clear differences among the languages. Our models perform significantly better on languages that have been seen by Whisper during pretraining than on the language unseen by Whisper. This finding highlights the potential challenge in lyrics transcription and alignment for low-resource languages that have not been covered by pretrained speech models.
Loading