SCIR: Learning Speech-based Conversational Interaction Representations from Continuous Acoustic Signals

SCIR: Learning Speech-based Conversational Interaction Representations from Continuous Acoustic Signals

ACL ARR 2026 January Submission7511 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spoken Dialogue Systems, Turn-taking, Speech Representation Learning, Real-time Interaction, Full-duplex Dialogue, Acoustic Modeling, Backchannel Prediction, Barge-in Detection, Low-latency

Abstract: Real-time spoken dialogue systems demand precise, low-latency decisions on when to speak, listen, or yield—a challenge intensified in full-duplex settings characterized by speech overlap and competitive turn-taking. While emerging end-to-end Speech LLMs offer low latency, they often lack explicit controllability and robustness, whereas traditional cascade systems suffer from unavoidable processing delays due to ASR and generation. This work investigates the learning of conversational interaction representations directly from continuous acoustic signals to bridge this gap. We propose SCIR, a task-driven representation learned end-to-end, which unifies interaction timing decisions—including turn-taking, backchanneling, and barge-in—under a single streaming-compatible framework via explicit multi-task learning, without relying on textual inputs. Through extensive experiments, we demonstrate that lightweight SCIR models not only surpass large-scale, general-purpose speech baselines in predictive performance but do so with orders-of-magnitude lower latency and parameter efficiency. Crucially, we show that SCIR's anticipatory nature provides a "negative latency" buffer that effectively masks the computational overhead of cascade ASR and LLM pipelines. This establishes SCIR as a robust, plug-and-play, and intelligence-preserving solution for next-generation real-time dialogue agents.

Paper Type: Long

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: Speech Processing, Spoken Language Understanding, Dialogue Systems, Multimodality

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English, Mandarin Chinese

Submission Number: 7511

Loading