Can AI Understand Mandarin Speech Prosody? A Framework and Benchmark Showcase

Published: 2025, Last Modified: 21 Jan 2026INTERSPEECH 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: How to model and estimate speech prosody is considered as a challenging task in understanding and generating natural speech. We introduce the Mandarin Speech Prosody Benchmark (MSPB), a linguistically grounded dataset for evaluating Speech Large Language Models (Speech LLMs) in Mandarin. MSPB comprises eight tasks covering crucial prosodic features and their interactions with syntax, semantics, and pragmatics. All MSPB items, designed per Mandarin linguistic principles and validated by experts, were phonetically recorded and verified. We evaluated six Speech LLMs (GPT-4o, Gemini-1.5-Pro, Gemini-2-Flash, Qwen2-Audio-7B-Instruct, GLM-4-Voice, MiniCPM-o 2.6). Although some models perform well with context-rich cues (e.g., irony), they generally struggle with subtle prosodic variations (e.g., focus marking) and underperform humans. MSPB provides a valuable tool to assess and enhance prosodic comprehension, underscoring the need for improved prosodic integration in future research.
Loading