Abstract: Video-driven audio synthesis aims to generate synchronized and contextually appropriate audio based on visual content, with applications in multimedia, virtual reality, and film production. Existing methods often rely solely on visual cues, leading to suboptimal audio generation that lacks synchronization and semantic alignment. To address these challenges, we introduce a novel video-guided audio synthesis method, termed~\textit{AvSyncDiff}. Unlike traditional approaches, AvSyncDiff leverages both visual and textual inputs, along with an optional audio prompt, to achieve precise control over the audio generation, enhancing the quality and realism of the synthesized audio. Furthermore, we propose a Gaussian Mixture Diffusion Search (GMDS) algorithm, a test-time scaling strategy inspired by advancements in the text-to-image domain. GMDS employs a dual-scale sampling mechanism to adaptively explore the latent space, balancing local exploitation and global exploration through a combination of small and large step sizes. The experimental results demonstrate that AvSyncDiff significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative evaluations, showcasing its potential for diverse applications in multimedia and beyond.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Text and video to audio generation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5690
Loading