Speech-CLAP: Towards Style-Aware Speech Representation

Speech-CLAP: Towards Style-Aware Speech Representation

ICLR 2026 Conference Submission19491 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Fine-grained Speech Style, Multimodal Representation, Benchmark, Human Preference Alignment

Abstract: Contrastive Language–Audio Pretraining (CLAP) has shown strong performance in modeling general audio--text, but remains limited in capturing complex and diverse speech styles. We propose Speech-CLAP, a contrastive model that learns joint representations of speech audio and style descriptions, capturing both intrinsic speaker characteristics (e.g., age, gender, timbre) and dynamic expressive features (e.g., emotion, speaking rate, intonation). The model is trained on a 10,000-hour speech–style corpus with detailed textual descriptions of speech styles, and we further introduce the Speech-Style Similarity Benchmark ($S^3$Bench), the first cross-lingual benchmark for speech-style similarity, which includes both Chinese and English speech-style pairs with human preference annotations. Experimental results show that Speech-CLAP aligns closely with human judgments. This work not only provides a solid foundation for style-aware speech representation but also establishes an important evaluation standard for future research on speech-style modeling. We will release both the Speech-CLAP model and the $S^3$Bench to the community to facilitate future research on speech-style modeling.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 19491

Loading