StreamFlow: Streaming Audio Generation from Discrete Tokens via Streaming Flow Matching

Ha-Yeong Choi; Sang-Hoon Lee

StreamFlow: Streaming Audio Generation from Discrete Tokens via Streaming Flow Matching

Ha-Yeong Choi, Sang-Hoon Lee

Published: 18 Sept 2025, Last Modified: 20 Dec 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: Streaming Generation, Streaming Flow Matching, Neural Audio Codec, Speech Language Models, Generative Models

TL;DR: We introduce Streaming Flow Matching, a novel streaming generative model for real-time audio generation from discrete tokens.

Abstract: Diffusion models have demonstrated remarkable generative capabilities, and Conditional Flow Matching (CFM) has improved their inference efficiency by following optimal transport paths. However, CFM-based models still require multiple iterative sampling steps, which makes them unsuitable for real-time or streaming generation scenarios. In this paper, we introduce StreamFlow, a novel streaming generative model designed for real-time audio generation from discrete tokens. StreamFlow leverages a causal noising training framework along the time axis and predicts multi-time vector fields at once on each stream, enabling streaming inference with minimal latency. To further improve generalization, we propose Scale-DiT, a Diffusion Transformer architecture that enhances robustness by modeling, normalizing, and scaling feature differences prior to skip connections. This significantly improves the robustness and performance of DiT without increasing the parameter size. We validate the effectiveness of StreamFlow through audio reconstruction tasks using discrete tokens from EnCodec and Mimi, demonstrating both high-fidelity synthesis and streaming capability. Furthermore, we successfully incorporated our model into fully-duplex streaming speech language models of Moshi by replacing the Mimi decoder.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 20862

Loading