Keywords: spoken language model, long context, survey
Abstract: Spoken Language Models (SLMs) are rapidly emerging as universal speech processing systems, yet their ability to handle long-context audio remains limited. Audio is temporally dense and encodes rich semantic, paralinguistic, and acoustic information, making long-range modeling particularly challenging. This survey examines long-context spoken language modeling across two settings distinguished by where long context arises: within a single turn, covering long-form audio understanding, generation, and multi-audio reasoning; and across turns in multi-turn spoken dialogue. We review representative models, benchmarks, and potential technical solutions, and discuss open challenges and promising future directions.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: speech technologies, spoken language understanding, spoken dialog
Contribution Types: Surveys
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: no
Submission Number: 15774
Loading