CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages

ACL ARR 2025 February Submission3265 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities--including sheet music, performance signals, and audio recordings--with multilingual text in a shared representation space, enabling retrieval across unaligned modalities with text as a bridge. It features a multilingual text encoder adaptable to unseen languages, exhibiting strong cross-lingual generalization. Leveraging retrieval-augmented generation, we curated M4-RAG, a web-scale dataset consisting of 2.31 million music-text pairs. This dataset is enriched with detailed metadata that represents a wide array of global musical traditions. To advance future research, we release WikiMT-X, a benchmark comprising 1,000 triplets of sheet music, audio, and richly varied text descriptions. Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple MIR tasks, significantly surpassing previous strong baselines and demonstrating excellent generalization in multimodal and multilingual music contexts.
Paper Type: Long
Research Area: Information Retrieval and Text Mining
Research Area Keywords: Music Information Retrieval (MIR), Multimodal Learning, Cross-Lingual Retrieval, Contrastive Learning, Music Representation Learning
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English, Russian, French, Spanish, Arabic, Chinese, Finnish, Greek, Tamil, Kazakh, Amharic and other 89 languages (100 in total)
Submission Number: 3265
Loading