AvSyncDiff: Video-guided Audio Generation via Enhanced Multimodal Feature

AvSyncDiff: Video-guided Audio Generation via Enhanced Multimodal Feature

ACL ARR 2025 May Submission5690 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Video-driven audio synthesis aims to generate synchronized and contextually appropriate audio based on visual content, with applications in multimedia, virtual reality, and film production. Existing methods often rely solely on visual cues, leading to suboptimal audio generation that lacks synchronization and semantic alignment. To address these challenges, we introduce a novel video-guided audio synthesis method, termed~\textit{AvSyncDiff}. Unlike traditional approaches, AvSyncDiff leverages both visual and textual inputs, along with an optional audio prompt, to achieve precise control over the audio generation, enhancing the quality and realism of the synthesized audio. Furthermore, we propose a Gaussian Mixture Diffusion Search (GMDS) algorithm, a test-time scaling strategy inspired by advancements in the text-to-image domain. GMDS employs a dual-scale sampling mechanism to adaptively explore the latent space, balancing local exploitation and global exploration through a combination of small and large step sizes. The experimental results demonstrate that AvSyncDiff significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative evaluations, showcasing its potential for diverse applications in multimedia and beyond.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Text and video to audio generation

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 5690

Loading