Keywords: Fine-grained Speech Style, Multimodal Representation, Benchmark, Human Preference Alignment
Abstract: Contrastive Language–Audio Pretraining (CLAP) has shown strong performance in modeling general audio--text, but remains limited in capturing complex and diverse speech styles. We propose Speech-CLAP, a contrastive model that learns joint representations of speech audio and style descriptions, capturing both intrinsic speaker characteristics (e.g., age, gender, timbre) and dynamic expressive features (e.g., emotion, speaking rate, intonation). The model is trained on a 10,000-hour speech–style corpus with detailed textual descriptions of speech styles, and we further introduce the Speech-Style Similarity Benchmark ($S^3$Bench), the first cross-lingual benchmark for speech-style similarity, which includes both Chinese and English speech-style pairs with human preference annotations. Experimental results show that Speech-CLAP aligns closely with human judgments. This work not only provides a solid foundation for style-aware speech representation but also establishes an important evaluation standard for future research on speech-style modeling. We will release both the Speech-CLAP model and the $S^3$Bench to the community to facilitate future research on speech-style modeling.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 19491
Loading