Keywords: Audio-to-Video, temporal tokens
TL;DR: Adaptation of text-to-video models to audio-to-video by learning the mapping from audio to tokens
Abstract: We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos must be aligned globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text and audio and, for the first time, on both text and audio. We extensively validate our method on three datasets demonstrating significant semantic diversity of audio-video samples. We further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on detecting and comparing energy peaks in both modalities. Compared to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound concerning content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse.
Submission Number: 6
Loading