Keywords: Music Language Model, MultiModal Language Model, Music Understanding, Music Generation
Abstract: Music is a unique and essential modality constituting human life, presenting challenges for multimodal advances due to its complex structure and intricate details. Recent Music Language Models (MuLMs) facilitate music understanding and generation by leveraging the inherent knowledge and reasoning capabilities of pre-trained Language Models (LMs), yet they overlook the complementary benefits of different music representations. To this end, we propose a unified music language model, named UniMuLM, form the existing approach of using a single representation to multiple music representations. Concerning the unification, we address the challenges of missing modalities and unstable training to adapt different scenarios. Specifically, we integrate symbolic, waveform music, and textual instructions into an LM and design a bar-level tokenizer to explore the fine-grained correlations between different modalities. Moreover, we propose a multi-stage training strategy to progressively enhance this synergy. Trained on open-source datasets, UniMuLM demonstrates superior performance compared to SOTA methods across five music tasks, evaluated on nine benchmark datasets.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10159
Loading