U-MusT: Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

Jongmin Jung; Dongmin Kim; Sihun Lee; Seola Cho; Hyungjoon Soh; Irmak Bukey; Chris Donahue; Dasaem Jeong

U-MusT: Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

Jongmin Jung, Dongmin Kim, Sihun Lee, Seola Cho, Hyungjoon Soh, Irmak Bukey, Chris Donahue, Dasaem Jeong

Published: 08 Sept 2025, Last Modified: 16 Sept 2025LLM4Music @ ISMIR 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Cross-modal music translation, Multitask Learning, Optical music recognition, Automatic music transcription, Image-to-Audio, MIDI-to-Audio, Music information retrieval, YouTube Score Video dataset

TL;DR: A single Transformer for multi-task cross-modal MIR, trained with YTSV—1,300+ hours of score-image/audio pairs—and unified tokenization, surpasses specialists (OMR SER 24.58%→13.67%, SOTA) and enables first score-image→audio generation.

Abstract: Traditional Music Information Retrieval (MIR) tasks like Optical Music Recognition (OMR) and Automatic Music Transcription (AMT) typically rely on specialized, single-task models. We challenge this paradigm by proposing a unified framework that trains a single Transformer on multiple cross-modal translation tasks simultaneously. Our approach is enabled by two key contributions: a novel large-scale dataset (YTSV) with over 1,300 hours of paired score-image and audio data, and a unified tokenization scheme that converts all music modalities into a common sequence format. Experiments show our multitask model significantly outperforms specialized baselines, reducing the OMR symbol error rate from 24.58\% to a state-of-the-art 13.67\%. Most notably, our framework achieves the first successful end-to-end generation of audio directly from a score image, marking a significant breakthrough in cross-modal music understanding and generation.

Submission Number: 4

Loading