Robust Spoken Pragmatics: A Benchmark for Large Speech-Language Models

Robust Spoken Pragmatics: A Benchmark for Large Speech-Language Models

ACL ARR 2026 January Submission7791 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spoken Pragmatics; Large Speech-Language Models; Human-Computer Interaction

Abstract: The end-to-end Large Speech-Language Models (LSLMs) have been revolutionizing the classic and hierarchy-based spoken dialogue systems because of their potential in latency reduction, seamless modality integration, and unified optimization across speech understanding, language reasoning, and speech generation. However, these models lack a systematic evaluation in "Robust Spoken Pragmatics", the ability to infer a speaker's $\textit{true}$ communicative intent by jointly interpreting the literal words, their acoustic realization, and the dynamic context of the interaction. To address this critical evaluation gap, we introduce, for the first time to the best of our knowledge, a Benchmark for Robust Spoken Pragmatics in LSLMs. It consists of five primary evaluation scenarios (i.e., Contextual Auditory Attention, Dynamic Addressee Tracking, Atypical Reference Resolution, Prosodic Disambiguation, and Nonliteral Intent Recognition), encompassing 11 subtasks in total. Through extensive experiments on nine mainstream LSLMs, we uncover a significant performance bottleneck: although these models demonstrate strong downstream reasoning and generation capabilities, their limitations in handling fundamental spoken pragmatic challenges critically constrain their overall interactional effectiveness. They often misinterpret the core communicative intent, resulting in responses grounded in incorrect assumptions. We will release our benchmark to facilitate further research into building more robust and interactionally intelligent LSLMs.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: evaluation and metrics; human-in-the-loop

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English, Chinese

Submission Number: 7791

Loading