Keywords: Multimodal Embeddings, Language-Audio Embeddings, Timbre, Perception
TL;DR: This work evaluates the ability of joint language-audio embeddings in capturing perceptual timbre semantics.
Abstract: Understanding and modeling the relationship between language and sound is critical for applications such as music information retrieval, text-guided music generation, and audio captioning. Central to these tasks is the use of joint language–audio embedding spaces, which map textual descriptions and auditory content into a shared embedding space. While multimodal models such as MS-CLAP, LAION-CLAP and MuQ-MuLan have shown strong performance in aligning language and audio, their correspondence to human perception of timbre, a multifaceted attribute encompassing qualities such as brightness, roughness, and warmth, remains underexplored. In this paper, we evaluate the above three joint language–audio embedding models on their ability to capture perceptual dimensions of timbre. Our findings show that LAION-CLAP consistently provides the most reliable alignment with human-perceived timbre semantics across both instrumental sounds and audio effects.
Track: Paper Track
Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.
Submission Number: 60
Loading