OpenFoley: Open-Set Video-to-Audio Generation with Modality-Aware Masking and Flows

08 Sept 2025 (modified: 26 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video-to-audio generation, video-audio learning
Abstract: Video-to-audio generation has emerged as a promising frontier for enriching multimodal understanding and synthesis. However, most existing approaches operate under closed-set assumptions, restricting training and evaluation to predefined categories and limiting generalization in open-world scenarios. Prior methods primarily rely on pre-trained vision-language or audio-language encoders such as CLIP and CLAP, overlooking the strong inherent video–audio correspondence that can directly guide cross-modal grounding. In this work, we present \method, a novel framework for open-set video-to-audio generation that enforces semantic fidelity and rhythmic synchronization across modalities. Our approach introduces a modality-aware dynamic masking strategy, where audio segments are reconstructed from masked video frames and vice versa, enabling the model to capture fine-grained temporal alignment without relying solely on external encoders. Furthermore, we design a generalized masked flow-based module that conditions generation on selectively sampled video frames, significantly improving efficiency and fidelity while preserving cross-modal coherence. Comprehensive experiments on VGGSound and a newly curated open-set benchmark demonstrate that \method consistently outperforms state-of-the-art baselines in both objective and perceptual metrics, achieving superior Fréchet Audio Distance (FAD) and Kullback–Leibler (KL) divergence scores. The project page can be found at: https://openfoley.github.io.
Primary Area: generative models
Submission Number: 3175
Loading