Can Audio LLMs Understand Spoken Language? An Inference Test Based on Alternative Semantics

Can Audio LLMs Understand Spoken Language? An Inference Test Based on Alternative Semantics

ACL ARR 2026 January Submission9119 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: audio LLM, semantics, focus

Abstract: We introduce a new inference task of audio LLMs, where the correct response crucially depends on the location of a focal accent. Models are tested under a variety of settings, and are only able to beat a text-only baseline with helpful prompting, including few shot examples. The proposed task shows for the first time how to test the ability of LLMs to incorporate audio information in semantic interpretation. The results show that the test is very challenging for the models tested, indicating that, for spoken language, LLMs lag far behind human abilities.

Paper Type: Short

Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics

Research Area Keywords: semantics, language models, prosody, inference

Contribution Types: Data resources, Theory

Languages Studied: English

Submission Number: 9119

Loading