Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

ICLR 2026 Conference Submission16326 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion models, flow-based generative models, video-to-audio synthesis, multimodal learning
TL;DR: Step-by-step video-to-audio synthesis via incremental generation of missing sound events without the need for specialized training datasets.
Abstract: We propose a step-by-step video-to-audio (V2A) generation method for finer controllability over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach aims to comprehensively capture all sound events induced by a video through the incremental generation of missing sound events. To avoid the need for costly multi-reference video–audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of existing sounds. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from adjacent segments of the same video, allowing training with standard single-reference audiovisual datasets that are easily accessible. Objective and subjective evaluations show that our method improves the separability of generated sounds at each step and enhances the overall quality of the final composite audio, outperforming existing baselines.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16326
Loading