FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

ICLR 2026 Conference Submission9395 Authors

17 Sept 2025 (modified: 27 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video-to-audio, multi-modal controlled audio generation, style transfer, text-to-audio
TL;DR: Our framework FoleyGenEx integrates conditional injection, multi-modal dynamic masking, and novel adverb-based data augmentation. These enable diverse, fine-grained controlled audio generation, supporting six distinct multimodal controlled types.
Abstract: We introduce FoleyGenEx, a unified framework for video-to-audio (VTA) generation that integrates multi-modal control, frame-level temporal alignment, and fine-grained semantic expressivity, enabling synchronized, versatile, and expressive audio synthesis across diverse tasks. Existing VTA methods either offer multi-modal control with weak temporal alignment or achieve strong alignment while lacking reference audio conditioning and semantic precision. FoleyGenEx bridges this gap through three key innovations: a conditional injection mechanism enabling audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving synchronization during multi-modal training, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enrich audio representations and textual supervision with nuanced semantic cues. Experiments on AudioCaps, VGGSound, and Greatest Hits show that FoleyGenEx delivers competitive performance in controllable VTA generation, achieving strong temporal fidelity, versatile multi-modal control, and fine-grained semantic precision compared to existing methods. Demo samples are available at https://foleygenex.github.io/FoleyGenEx.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9395
Loading