Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Published: 08 Sept 2025, Last Modified: 10 Sept 2025LLM4Music @ ISMIR 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Music Generation, Retrieval-Augmented Generation, Controllable Generation
TL;DR: MTM uses text and music as explicit bridges, combined with dual-track retrieval and a ControlFormer, to generate high-quality, controllable music from text, images, and video; trained on 24 K-sample MTV-24K.
Abstract: Multimodal music generation aims to produce music from diverse input modalities such as text, videos, and images. Existing methods typically rely on a shared embedding space, but face challenges such as insufficient cross-modal data, weak cross-modal alignment, and limited controllability. This paper addresses these issues by introducing explicit bridges of text and music for improved alignment. We propose a novel pipeline for constructing multimodal music datasets, yielding two new datasets, namely MTV-24K and MT-512K, all annotated with rich musical attributes. Additionally, we propose MTM, a Multimodal-to-Text-to-Music framework based on these bridges. Experiments on video-to-music, image-to-music, text-to-music, and controllable music generation tasks demonstrate that MTM significantly improves music quality, modality alignment, and user controllability compared to existing methods.
Submission Number: 3
Loading