Keywords: Cross-modal music translation, Multitask Learning, Optical music recognition, Automatic music transcription, Image-to-Audio, MIDI-to-Audio, Music information retrieval, YouTube Score Video dataset
TL;DR: A single Transformer for multi-task cross-modal MIR, trained with YTSV—1,300+ hours of score-image/audio pairs—and unified tokenization, surpasses specialists (OMR SER 24.58%→13.67%, SOTA) and enables first score-image→audio generation.
Abstract: Traditional Music Information Retrieval (MIR) tasks like Optical Music Recognition (OMR) and Automatic Music Transcription (AMT) typically rely on specialized, single-task models. We challenge this paradigm by proposing a unified framework that trains a single Transformer on multiple cross-modal translation tasks simultaneously. Our approach is enabled by two key contributions: a novel large-scale dataset (YTSV) with over 1,300 hours of paired score-image and audio data, and a unified tokenization scheme that converts all music modalities into a common sequence format. Experiments show our multitask model significantly outperforms specialized baselines, reducing the OMR symbol error rate from 24.58\% to a state-of-the-art 13.67\%. Most notably, our framework achieves the first successful end-to-end generation of audio directly from a score image, marking a significant breakthrough in cross-modal music understanding and generation.
Submission Number: 4
Loading