MUSCAT: A Multimodal mUSic Collection for Automatic Transcription of Real Recordings and Image Scores

Alejandro Galán-Cuenca, Jose J. Valero-Mas, Juan C. Martinez-Sevilla, Antonio Hidalgo-Centeno, Antonio Pertusa, Jorge Calvo-Zaragoza

Published: 2024, Last Modified: 13 Nov 2024ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multimodal audio-image music transcription has been recently posed as a means of retrieving a digital score representation by leveraging the individual estimations from Automatic Music Transcription (AMT)---acoustic recordings---and Optical Music Recognition (OMR)---image scores---systems. Nevertheless, while proven to outperform single-modality recognition rates, this approach has been exclusively validated under controlled scenarios---monotimbral and monophonic synthetic data---mainly due to a lack of collections with symbolic score-level annotations for both recordings and graphical sheets. To promote research on this topic, this work presents the Multimodal mUSic Collection for Automatic Transcription (MUSCAT) assortment of acoustic recordings, image sheets, and their score-level annotations in several notation formats. This dataset comprises almost 80 hours of real recordings with varied instrumentation and polyphony degrees---ranging from piano to orchestral music---, 1251 scanned sheets, and 880 symbolic scores from 37 composers, which may also be used in other tasks involving metadata such as instrument identification or composer recognition. A fragmented subset of this collection solely focused on acoustic data for score-level AMT---the MUSic Collection for aUtomatic Transcription - fragmented Subset (MUSCUTS) assortment---is also presented together with a baseline experimentation, concluding the need to foster research on this field with real recordings. Finally, a web-based service is also provided to increase the size of the collections collaboratively.