MUSCAT: a Multimodal mUSic Collection for Automatic Transcription of real recordings and image scores

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal audio-image music transcription has been recently posed as a means of retrieving a digital score representation by leveraging the individual estimations from Automatic Music Transcription (AMT)---acoustic recordings---and Optical Music Recognition (OMR)---image scores---systems. Nevertheless, while proven to outperform single-modality recognition rates, this approach has been exclusively validated under controlled scenarios---monotimbral and monophonic synthetic data---mainly due to a lack of collections with symbolic score-level annotations for both recordings and graphical sheets. To promote research on this topic, this work presents the $\textit{Multimodal mUSic Collection for Automatic Transcription}$ (MUSCAT) assortment of acoustic recordings, image sheets, and their score-level annotations in several notation formats. This dataset comprises almost 80 hours of real recordings with varied instrumentation and polyphony degrees---from piano to orchestral music---1251 scanned sheets, and 880 symbolic scores from 37 composers, which may also be used in other tasks involving metadata such as instrument identification or composer recognition. A fragmented subset of this collection exclusively focused on acoustic data for score-level AMT---the $\textit{MUSic Collection for aUtomatic Transcription - fragmented Subset}$ (MUSCUTS) assortment---is also presented together with a baseline experimentation, concluding the need to foster research on this field with real recordings. Finally, a web-based service is also provided to increase the size of the collections collaboratively.
Primary Subject Area: [Experience] Art and Culture
Secondary Subject Area: [Experience] Multimedia Applications, [Content] Media Interpretation, [Systems] Data Systems Management and Indexing
Relevance To Conference: Dear MM Organization staff, We are submitting our paper titled "MUSCAT: a Multimodal mUSic Collection for Automatic Transcription of real recordings and image scores" for consideration at the upcoming ACMMM 2024. We believe that our work significantly contributes to the advancement of multimedia and multimodal processing, aligning well with the goals of the SIGMM community. Our research presents a collection of real audio recordings and image sheets along with notation-level digital scores. The aim is to address the lack of existing multimodal music corpora, particularly from real data. Our work intends to advance the state of the art for multimodal image and audio music transcription and, in a complementary way, for unimodality as well. This would help because, to the best of our knowledge, the underdevelopment of this multimodal field in the literature is mainly related to the lack of available datasets and their limited size, in addition to the lack of consensus among them on format or file types. We recognize the importance of highlighting the main focus of the study. As such, we set out that our paper emphasizes mainly the multimedia transcription aspects, aligning closely with the conference's multimodal/multimedia objectives. Sincerely, Alejandro Galán-Cuenca, University of Alicante.
Submission Number: 4796
Loading