EmoNet-Voice: A Large-Scale Synthetic Benchmark for Fine-Grained Speech Emotion

EmoNet-Voice: A Large-Scale Synthetic Benchmark for Fine-Grained Speech Emotion

ICLR 2026 Conference Submission17877 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: speech emotion recognition, synthetic speech dataset, fine-grained emotions, emotion intensity annotation, privacy-preserving data, multilingual audio, benchmark evaluation, expert validation, cross-dataset generalization, pre-training dataset

TL;DR: A privacy-preserving 5,000-hour multilingual speech emotion dataset spanning 40 categories with expert consensus labels, enabling ethical study of sensitive emotions and achieving competitive real-world performance.

Abstract: Speech emotion recognition (SER) systems are constrained by existing datasets that typically cover only 6-10 basic emotions, lack scale and diversity, and face ethical challenges when collecting sensitive emotional states. We introduce EmoNet-Voice, a comprehensive resource addressing these limitations through two components: (1) EmoNet-Voice Big, a 5,000-hour multilingual pre-training dataset spanning 40 fine-grained emotion categories across 11 voices and 4 languages, and (2) EmoNet-Voice Bench, a rigorously validated benchmark of 4,7k samples with unanimous expert consensus on emotion presence and intensity levels. Using state-of-the-art synthetic voice generation, our privacy-preserving approach en- ables ethical inclusion of sensitive emotions (e.g., pain, shame) while maintaining controlled experimental conditions. Each sample underwent validation by three psychology experts. We demonstrate that our Empathic Insight models trained on our synthetic data achieve strong real-world dataset generalization, as tested on EmoDB and RAVDESS. Furthermore, our comprehensive evaluation reveals that while high-arousal emotions (e.g., anger: 95% accuracy) are readily detected, the benchmark successfully exposes the difficulty of distinguishing perceptually similar emotions (e.g., sadness vs. distress: 63% discrimination), providing quantifiable metrics for advancing nuanced emotion AI. EMONET-VOICE establishes a new paradigm for large-scale, ethically-sourced, fine-grained SER research.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 17877

Loading