Keywords: Foundation models, Audio-language models, Large language models, Timbre, Music cognition, Music emotion recognition, Zero-shot learning, Prompt engineering
TL;DR: We evaluate foundation models on timbre-related cognitive tasks and show that a hybrid pipeline combining CLAP descriptors with Centaur best reproduces human responses in music emotion recognition.
Abstract: Foundation models are increasingly applied to MIR tasks, yet their performance on music cognition problems remains underexplored. In this work, we investigate how state-of-the-art audio-language models and large language models (LLMs) perform on timbre-related cognitive tasks. We focus on music emotion recognition which captures listeners’ perceived and induced emotions in response to instrument tones, and run additional tests on instrument recognition. We evaluate contrastive audio-language models (CLAP variants and MuQ-MuLan) in both zero-shot and probe-based settings, and compare their performance with Centaur, a recent LLM fine-tuned on human decision patterns. We further propose a novel inference pipeline that integrates CLAP descriptors as intermediate textual prompts for LLMs. Results show that LLMs, especially Centaur, outperform both zero-shot and probe-trained contrastive models, while the hybrid pipeline yields the best performance overall. Our findings suggest that combining audio-language and language-only models provides a promising direction for modelling music-related cognition, with implications for applications such as music recommendation, generation, and adaptive audio interfaces.
Submission Number: 24
Loading