MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts

Huaye Zhang, Chenglizhao Chen, Mengke Song, Tingting Chen, Diqiong Jiang, Lichun Liu, Xinyu Liu

Published: 22 Feb 2026, Last Modified: 22 Mar 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Recent technologies such as music retrieval, soundtrack generation, and video understanding have developed rapidly. Consequently, the aesthetic evaluation of video soundtracks has become an important research topic in academia. Soundtracks are key elements in shaping the emotional atmosphere and driving the narrative rhythm. Therefore, they require systematic methods to assess their artistic coordination with visual content. However, existing approaches mostly focus on evaluating the quality of the music itself. They often lack the ability to model the deeper aesthetic synergy between audio and visuals. To address this gap, we propose MEMA, a new soundtrack aesthetic evaluation model. MEMA employs a two-stage training strategy. The first stage builds a crossmodal imagination mechanism using a Conditional Variational Autoencoder. This method achieves bidirectional semantic reconstruction between audio and visuals. The second stage introduces a Guided Cross-Attention Alignment Module. This module enhances the model’s focus on key narrative moments in video. To facilitate this research, we also construct VMAE-Sets. It is the first large-scale dataset dedicated to soundtrack aesthetic evaluation. Finally, MEMA performs scoring and textual evaluation along three core aesthetic dimensions. Experimental results demonstrate that MEMA outperforms existing methods, achieving average improvements of 18.137% in LCC and 17.866% in SRCC compared to the strongest baseline. These findings confirm its superior audio–visual narrative alignment, demonstrating high consistency with human judgments.