TiVA: Time-Aligned Video-to-Audio Generation

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video-to-audio generation is crucial for autonomous video editing and post-processing, which aims to generate high-quality audio for silent videos with semantic similarity and temporal synchronization. However, most existing methods mainly focus on matching the semantics of the visual and acoustic modalities while merely considering their temporal alignment in a coarse granularity, thus failing to achieve precise synchronization. In this study, we propose a novel time-aligned video-to-audio framework, called TiVA, to achieve semantic matching and temporal synchronization jointly when generating audio. Given a silent video, our method encodes its visual semantics and predicts an audio layout separately. Then, leveraging the semantic latent embeddings and the predicted audio layout as condition, it learns a latent diffusion-based audio generator. Comprehensive objective and subjective experiments demonstrate that our method consistently outperforms state-of-the-art methods on semantic matching and temporal synchronization.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Experience] Multimedia Applications, [Content] Media Interpretation
Relevance To Conference: We propose a novel time-aligned framework that employs a simple yet effective tempo layout to enhance video-to-audio cross-modality generation.
Supplementary Material: zip
Submission Number: 2265
Loading