MTalk-Bench: Multi-Turn Dialogue Benchmark for Speech-to-Speech Large Language Models

ACL ARR 2025 May Submission7403 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rapid advancement of speech-to-speech (S2S) large language models (LLMs) has brought impressive progress in real-time spoken interaction. However, current evaluation methods fall short in assessing their multi-turn dialogue capabilities, especially under realistic and complex communication settings. To fill this gap, we introduce MTalk-Bench, the first multi-turn S2S benchmark, specifically designed to evaluate S2S LLMs across 9 high-frequency multi-turn dialogue scenarios. MTalk-Bench adopts a three-tier evaluation framework covering Semantic Information, Paralinguistic Information, and Ambient Sound, reflecting the rich dynamics of human conversation. We conduct both human and LLM-based evaluations, and further analyze the reliability of LLMs as judges. Experimental results demonstrate that GPT-4o-realtime consistently achieves state-of-the-art performance across all tiers, and also exhibits strong reliability when serving as an evaluator. While several S2S LLMs show promising results in semantic comprehension, they still struggle with conversations involving paralinguistic and environmental audio cues. MTalk-Bench offers a standardized and multidimensional evaluation tool to drive future research toward more context-aware, robust S2S dialogue systems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Arena; Multi-turn; Speech to Speech; LLM
Languages Studied: English
Submission Number: 7403
Loading