Abstract: Speech emotion retrieval is an important technique for large-scale and high-quality data collection. Conventional approach using ensemble of classification models might limit the retrieved emotion diversity and/or underperform in out-of-domain acoustic conditions. Natural language is diverse and agnostic to specific acoustic concepts, embedding a huge potential for developing language-based speech emotion retrieval system. In this paper we introduce CLAP4Emo, a novel framework to retrieve emotional speech via natural language prompts based on contrastive language-audio pretraining. To compensate for the absence of training captions in existing public datasets, we propose a systematic framework that applies ChatGPT to generate emotion captions. The experimental results demonstrate that our method can effectively improve the retrieved sample diversity while maintaining high precision across five benchmark datasets. By leveraging large language models, we establish a connection between audio and language for emotion description, culminating in an intuitive and interactive retrieval system. We release the generated emotion captions at: https://github.com/boschresearch/soundsee-emo-caps
Loading