Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

ICLR 2026 Conference Submission14380 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Speech-in Speech-out Dialogue Systems, Speech Language Models, Tool-Augmented Dialogue Systems, Retrieval-Augmented Generation
TL;DR: We present Streaming RAG, a framework enabling low-latency tool use in speech-in speech-out dialogue by issuing tool queries parallel with user speech, doubling factual accuracy and reducing response latency by 20%.
Abstract: End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR–LLM–TTS pipelines, generating more natural, expressive responses with significantly lower latency. However, these systems remain prone to hallucinations due to limited factual grounding. While text-based dialogue systems address this challenge by integrating tools such as web search and knowledge graph APIs, we introduce the first approach to extend tool use directly into speech-in speech-out systems. A key challenge is that tool integration substantially increases response latency, disrupting conversational flow. To mitigate this, we propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech,  even before the user finishes speaking. Specifically, we develop a post-training pipeline that teaches the model when to issue tool calls during ongoing speech and how to generate spoken summaries that fuse audio queries with retrieved text results, thereby improving both accuracy and responsiveness. To evaluate our approach, we construct AudioCRAG, a benchmark created by converting queries from the publicly available CRAG dataset into speech form. Experimental results demonstrate that our streaming RAG approach increases QA accuracy by over 200% relative and further enhances user experience by reducing tool use latency by 20%. Importantly, our streaming RAG approach is modality-agnostic and can be applied equally to typed input, paving the way for more agentic, real-time AI assistants.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 14380
Loading