Keywords: music, representation learning, evaluation, music understanding, music information retrieval
TL;DR: We present a semantically rich language-audio dataset that captures the idiosyncratic, colloquial, and semantically diverse language found in organic online musical discourse. We benchmark models on generative and retrieval music understanding tasks.
Abstract: We present MusicSem, a dataset of 32,493 language–audio music descriptions derived from organic discussions on Reddit. What sets MusicSem apart is its focus on capturing a broad spectrum of musical semantics, reflecting how listeners naturally describe music in nuanced, human-centered ways. To structure these expressions, we propose a taxonomy of five semantic categories: descriptive, atmospheric, situational, metadata-related, and contextual. Our motivation for releasing MusicSem stems from the observation that music representation learning models often lack sensitivity to these semantic dimensions, due to the limited expressiveness of existing training datasets. MusicSem addresses this gap by serving as a novel semantics-aware resource for training and evaluating models on tasks such as cross-modal music generation and retrieval.
Track: Paper Track
Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.
Submission Number: 62
Loading