Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

Published: 02 Mar 2026, Last Modified: 08 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0
Keywords: voice reasoning, large language models, multimodal evaluation, speech-language models, reasoning gap, voice assistants, benchmark
Abstract: Voice-interactive LLMs can transcribe speech with near-human accuracy and hold fluent conversations, yet we find they are strikingly unable to reason while talking. Evaluating 12 voice systems alongside text baselines on Voice Evaluation of Reasoning Ability (VERA) (2,931 voice-native episodes across five reasoning tracks), we document a severe and consistent Voice Reasoning Gap (VRG): on competition mathematics a leading text model achieves 74.8% accuracy while its voice counterpart reaches only 6.1%; macro-averaged, the best text model scores 54.0% versus 11.3% for voice. What makes this finding surprising is that every reasonable mitigation fails: extended thinking time yields negligible or even negative gains, and a cascade architecture that decouples a powerful reasoning backend from a fast narration frontend still falls far short of text parity. Failure analysis reveals that different architectures do not merely underperform; they fail in distinct, predictable ways. Streaming models produce fluent but incorrect responses; cascades introduce grounding errors. These architecture-specific error signatures point to a fundamental tension between real-time audio streaming and the iterative computation required for reasoning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 87
Loading