Keywords: Dynamic Music Emotion Recognition, Psychoacoustics, Cochleogram, Transformer
TL;DR: We propose PD-Former, a lightweight dual-stream architecture that achieves state-of-the-art performance in dynamic music emotion recognition by simulating human auditory perception with Transformers to capture long-range temporal dependencies.
Abstract: Dynamic Music Emotion Recognition (DMER) aims to track continuous emotional variations in music, yet machine predictions still lag behind human perception, which stems from a fundamental scientific issue: conventional acoustic features such as Mel spectrograms obscure spectral details critical for psychoacoustic cues like sensory dissonance and temporal fine structure. In addition, prevalent RNN-based models struggle to capture the long-range dependencies of musical narratives. To bridge this gap, we propose the Psychoacoustic-Informed Dual-Stream Transformer (PD-Former). The method introduces Cochleogram features to simulate basilar-membrane responses, capturing physiological texture cues that complement the acoustic structure information provided by Mel spectrograms. A dual-stream convolutional architecture processes these heterogeneous features independently before synergistic fusion, and a Transformer further models long-range temporal dependencies. Experiments on the DEAM dataset show that PD-Former achieves state-of-the-art performance while remaining lightweight. Ablation studies further validate the complementarity of psychoacoustic and acoustic features, the necessity of dual-stream fusion, and the superiority of the Transformer in capturing long-range dependencies. Our model achieves notable RMSE reductions—12.5% in Valence and 15.8% in Arousal over the acoustic-only baseline, and 5.6% and 2.1% respectively over state-of-the-art benchmarks on the DEAM dataset.
Submission Number: 65
Loading