Keywords: CLAP, Timbre, Embedding Space, Interpretability, Text-to-Music Retrieval, Text-to- Music Generation
TL;DR: This paper investigates CLAP's internal representations of Timbre by identifying directional vectors in its embedding space and using them as perturbations in TTM retrieval and generation experiments.
Abstract: The transition from feature-based language-audio representations to more high-dimensional ones from pre-trained foundation models has enabled us to map audio content to a significantly broader vocabulary of natural language. However, some interpretability of the alignment between the embedding spaces of the two modalities and their relation to psychoacoustic features is lost as a byproduct. In this study, we investigate timbre representations in CLAP in both the text embedding space and audio embedding space. We identify directions for different timbral qualities in each embedding space and use them as perturbation vectors in uni-modal and cross-modal Text-to-Music (TTM) Retrieval and Generation downstream tasks. We find that although both audio and text embeddings move monotonically along their respective timbre directions, timbral variation is more linearly distributed—and therefore more easily exploitable—in the audio embedding space. Cross-modal perturbation experiments further reveal that the audio and text embedding spaces form a geometrically aligned subspace with respect to timbre. Additionally, our analysis identifies cases where CLAP’s timbre representations closely align with perceptually grounded spectral features, and cases where such alignment is limited.
Submission Number: 24
Loading