SpecMaskFoley: Efficient Yet Effective Synchronized Video-to-audio Synthesis via Pretraining and ControlNet
Keywords: Foley, Video-to-audio, ControlNet, MaskGIT, Discrete Diffusion, Audio Generation
TL;DR: High-performance video-to-audio model with the following features: (1) Researcher friendly: simple conditioning mechanism. (2) Developer friendly: single GPU training. (3)User friendly: few-step generation and fast inference speed.
Abstract: Foley synthesis is a task of wide interest that aims to synthesize high-quality audio which is both semantically and temporally aligned with video frames.
To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction.
ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions.
In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders.
We have observed a performance gap between ControlNet-based and from-scratch foley models.
To narrow this gap, we propose SpecMaskFoley, a method that steers the pretrained SpecMaskGIT model toward video-synchronized foley synthesis via ControlNet.
To unlock the potential of a single ControlNet branch, we resolve the discrepancy between the temporal video features and the time-frequency nature of the pretrained SpecMaskGIT via a frequency-aware temporal feature aligner,
eliminating the need for complicated conditioning mechanisms widely used in prior arts.
Evaluations on a common foley synthesis benchmark demonstrate that SpecMaskFoley could even outperform strong from-scratch baselines, substantially advancing the development of ControlNet-based foley synthesis models.
Demo samples are uploaded as supplementary files.
Supplementary Material: zip
Submission Number: 2
Loading