EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

ICLR 2026 Conference Submission15900 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speech Language Models, Empathetic Dialogue, Multi‑Stage Evaluation, Benchmark, Voice Cues

TL;DR: EchoMind is an interrelated multi‑level benchmark evaluating empathetic dialogue in speech language models by unifying linguistic and paralinguistic understanding in a context‑linked framework.

Abstract: Speech Language Models (SLMs) have advanced spoken language understanding. However, it remains unclear whether they can truly hear you—recognizing not only spoken words but also non‑lexical vocal cues—and respond with empathy, aligning replies both emotionally and contextually. Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human‑like, emotionally intelligent conversation. We present EchoMind, the first interrelated, multi‑level benchmark that simulates the cognitive process of empathetic dialogue through sequential, context‑linked tasks: spoken‑content understanding, vocal‑cue perception, integrated reasoning, and response generation. All tasks share identical, semantically neutral scripts—free of explicit emotional or contextual cues—while controlled vocal‑style variations test the effect of delivery independent of the transcript. EchoMind is grounded in an empathy‑oriented framework spanning 3 coarse and 12 fine‑grained dimensions, encompassing 39 vocal attributes, and evaluated using both objective and subjective metrics. Testing 12 advanced SLMs reveals that even state‑of‑the‑art models struggle with high-expressive vocal cues, limiting empathetic response quality. Analyses of prompt strength, speech source, and ideal vocal cue recognition reveal persistent weaknesses in instruction‑following, resilience to natural speech variability, and effective use of vocal cues for empathy. These results underscore the need for SLMs that integrate linguistic content with diverse vocal cues to achieve truly empathetic conversational ability.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 15900

Loading