The Long Arc of Audio: A Comprehensive Survey of Long-Context Spoken Language Models

The Long Arc of Audio: A Comprehensive Survey of Long-Context Spoken Language Models

ACL ARR 2026 May Submission15774 Authors

26 May 2026 (modified: 14 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: spoken language model, long context, survey

Abstract: Spoken Language Models (SLMs) are rapidly emerging as universal speech processing systems, yet their ability to handle long-context audio remains limited. Audio is temporally dense and encodes rich semantic, paralinguistic, and acoustic information, making long-range modeling particularly challenging. This survey examines long-context spoken language modeling across two settings distinguished by where long context arises: within a single turn, covering long-form audio understanding, generation, and multi-audio reasoning; and across turns in multi-turn spoken dialogue. We review representative models, benchmarks, and potential technical solutions, and discuss open challenges and promising future directions.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: speech technologies, spoken language understanding, spoken dialog

Contribution Types: Surveys

Languages Studied: English

EMNLP 2026 AI Reviewing Experiment: no

Submission Number: 15774

Loading