- Abstract: Translating an audio sequence to a symbolic representation of music a fundamen- tal problem in Music Information Retrieval (MIR) refereed to Automatic Music Transcription (AMT). Recently, convolutional neural networks (CNNs) have been successfully applied to the task by translating frames of audioSigtia et al. (2016); Thickstun et al. (2017). However, those models can by their nature not model temporal relations and long time dependencies. Furthermore, it is extremely la- bor intense to get annotations for supervised learning in this setting. We propose a model that overcomes all these problems. The convolutional sequence to sequence (Cseq2seq) model applies a CNN to learn a low dimensional representation of au- dio frames and a sequential model to translate these learned features to a symbolic representation directly. Our approach has three advantages over other methods: (i) extracting audio frame representations and learning the sequential model is jointly trained end-to-end, (ii) the recurrent model can capture temporal features in musical pieces in order to improve transcription, and (iii) our model learns from entire sequences as opposed to temporally accurately annotated onsets and offsets for each note thus making it possible to train on large already existing corpora of music. For the purpose of testing our method we created our own dataset of 17K monophonic songs and respective MusicXML files. Initial experiments proof the validity of our approach.
- TL;DR: Solving automatic music transcription without note level annotated data deep learning style.
- Keywords: automatic music transcription, audio, music, deep learning, sparse data regime