SoundReactor: Frame-level Online Video-to-Audio Generation

Koichi Saito; Julian Tanke; christian simon; Masato Ishii; Kazuki Shimada; Zachary Novack; Zhi Zhong; Akio Hayakawa; Takashi Shibuya; Yuki Mitsufuji

SoundReactor: Frame-level Online Video-to-Audio Generation

Koichi Saito, Julian Tanke, christian simon, Masato Ishii, Kazuki Shimada, Zachary Novack, Zhi Zhong, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

18 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Frame-level online video-to-audio generation, video-to-audio generation, autoregressive models, diffusion models, multimodal generative modeling for sound and audio

TL;DR: We introduce the novel task of frame-level online video-to-audio generation and propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task.

Abstract: Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations, at low per frame token-level latency ($26.6$ms for the head NFE=1, $30.3$ms for NFE=4 with $30$FPS, $480$p videos using a single H100.). Demo samples are available at https://anonymous-sr-submission.github.io/.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11662

Loading