Keywords: Spoken Pragmatics; Large Speech-Language Models; Human-Computer Interaction
Abstract: The end-to-end Large Speech-Language Models (LSLMs) have been revolutionizing the classic and hierarchy-based spoken dialogue systems because of their potential in latency reduction, seamless modality integration, and unified optimization across speech understanding, language reasoning, and speech generation.
However, these models lack a systematic evaluation in "Robust Spoken Pragmatics", the ability to infer a speaker's $\textit{true}$ communicative intent by jointly interpreting the literal words, their acoustic realization, and the dynamic context of the interaction.
To address this critical evaluation gap,
we introduce, for the first time to the best of our knowledge, a Benchmark for Robust Spoken Pragmatics in LSLMs.
It consists of five primary evaluation scenarios (i.e., Contextual Auditory Attention, Dynamic Addressee Tracking, Atypical Reference Resolution, Prosodic Disambiguation, and Nonliteral Intent Recognition), encompassing 11 subtasks in total.
Through extensive experiments on nine mainstream LSLMs, we uncover a significant performance bottleneck: although these models demonstrate strong downstream reasoning and generation capabilities, their limitations in handling fundamental spoken pragmatic challenges critically constrain their overall interactional effectiveness. They often misinterpret the core communicative intent, resulting in responses grounded in incorrect assumptions. We will release our benchmark to facilitate further research into building more robust and interactionally intelligent LSLMs.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: evaluation and metrics; human-in-the-loop
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English, Chinese
Submission Number: 7791
Loading