Keywords: multi-modal learning, audio-visual learning, multi-modal large-language-model, text-guided video-to-audio generation, video-to-audio captioning
TL;DR: A novel multi-modal generation framework for text guided video-to-audio generation and video-to-audio captioning.
Abstract: The content of visual and audio scenes is multi-faceted such that a video stream can
be paired with various audio streams and vice-versa. Thereby, in video-to-audio
generation task, it is imperative to introduce steering approaches for controlling the
generated audio. While Video-to-Audio generation is a well-established generative
task, existing methods lack such controllability. In this work, we propose VATT, a
multi-modal generative framework that takes a video and an optional text prompt
as input, and generates audio and optional textual description (caption) of the
audio. Such a framework has two unique advantages: i) Video-to-Audio generation
process can be refined and controlled via text which complements the context
of the visual information, and ii) The model can suggest what audio to generate
for the video by generating audio captions. VATT consists of two key modules:
VATT Converter, which is an LLM that has been fine-tuned for instructions and
includes a projection layer that maps video features to the LLM vector space, and
VATT Audio, a bi-directional transformer that generates audio tokens from visual
frames and from optional text prompt using iterative parallel decoding. The audio
tokens and the text prompt are used by a pretrained neural codec to convert them
into a waveform. Our experiments show that when VATT is compared to existing
video-to-audio generation methods in objective metrics, such as VGGSound audiovisual dataset, it achieves competitive performance when the audio caption is
not provided. When the audio caption is provided as a prompt, VATT achieves
even more refined performance (with lowest KLD score of 1.41). Furthermore,
subjective studies asking participants to choose the most compatible generated
audio for a given silent video, show that VATT Audio has been chosen on average
as a preferred generated audio than the audio generated by existing methods. VATT
enables controllable video-to-audio generation through text as well as suggesting
text prompts for videos through audio captions, unlocking novel applications such
as text-guided video-to-audio generation and video-to-audio captioning.
Primary Area: Speech and audio
Flagged For Ethics Review: true
Submission Number: 1915
Loading