ConvM2D2: Improving Generative Music Evaluation using Self-Supervised Alternative to CLAP

Kehinde Abdulsalam Elelu; Joshua E Siegel; Saffary Ali; Luong, Duc Hung; Babatunde Simeon; Ebuka Okpala

ConvM2D2: Improving Generative Music Evaluation using Self-Supervised Alternative to CLAP

Kehinde Abdulsalam Elelu, Joshua E Siegel, Saffary Ali, Luong, Duc Hung, Babatunde Simeon, Ebuka Okpala

Published: 01 Aug 2025, Last Modified: 26 Aug 2025SpeechAI TTIC 2025 OralorPosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI-generated music, music evaluation, CLAP, M2D2, contrastive learning, MusicEval, AudioMOS, MIR

Presentation Preference: Yes

Abstract: Evaluating the perceptual quality of AI-generative music remains a challenge in music information retrieval and computational creativity applications. Approaches such as those adopted in the MusicEval and AudioMOS challenges primarily rely on CLAP, a contrastive audio-text model to extract embeddings for Mean Opinion Score (MOS) prediction. While CLAP excels at coarse audio-text alignment, it struggles to capture fine-grained musical attributes such as timbral richness, rhythmic precision, and structural coherence, leading to suboptimal alignment with expert human evaluations. We introduce ConvM2D2, a novel dual-branch neural architecture that leverages M2D2, a second-generation masked modeling framework, as the upstream audio encoder for MOS prediction. M2D2 is trained to reconstruct masked audio segments, enabling it to capture temporally- and acoustically-detailed features that more closely reflect human perceptual criteria. The ConvM2D2 model processes audio and text embeddings jointly through specialized convolutional and multi-layer perceptron pathways to predict both Overall Musical Quality and Textual Alignment scores. We evaluate ConvM2D2 on the MusicEval benchmark, comparing its performance against other models and achieve improvements across all evaluation metrics (MSE, LCC, SRCC, and KTAU) at both utterance- and system-level evaluation. ConvM2D2 reaches a system-level LCC of 0.964 and reduces MSE by 88\% compared to the baseline, demonstrating strong alignment with human judgments across both overall musical quality and textual alignment tasks. This big improvement indicates ConvM2D2 can judge AI-generated music much more like a musical expert, making it easier to find, improve, and recommend better-sounding music.

Submission Number: 11

Loading