SpecMaskFoley: Efficient Yet Effective Synchronized Video-to-audio Synthesis via Pretraining and ControlNet

Zhi Zhong; Akira Takahashi; Shuyang Cui; Keisuke Toyama; Shusuke Takahashi; Yuki Mitsufuji

SpecMaskFoley: Efficient Yet Effective Synchronized Video-to-audio Synthesis via Pretraining and ControlNet

Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, Yuki Mitsufuji

Published: 07 Aug 2025, Last Modified: 23 Aug 2025Gen4AVC PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Foley, Video-to-audio, ControlNet, MaskGIT, Discrete Diffusion, Audio Generation

TL;DR: High-performance video-to-audio model with the following features: (1) Researcher friendly: simple conditioning mechanism. (2) Developer friendly: single GPU training. (3)User friendly: few-step generation and fast inference speed.

Abstract: Foley synthesis is a task of wide interest that aims to synthesize high-quality audio which is both semantically and temporally aligned with video frames. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction. ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions. In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders. We have observed a performance gap between ControlNet-based and from-scratch foley models. To narrow this gap, we propose SpecMaskFoley, a method that steers the pretrained SpecMaskGIT model toward video-synchronized foley synthesis via ControlNet. To unlock the potential of a single ControlNet branch, we resolve the discrepancy between the temporal video features and the time-frequency nature of the pretrained SpecMaskGIT via a frequency-aware temporal feature aligner, eliminating the need for complicated conditioning mechanisms widely used in prior arts. Evaluations on a common foley synthesis benchmark demonstrate that SpecMaskFoley could even outperform strong from-scratch baselines, substantially advancing the development of ControlNet-based foley synthesis models. Demo page: https://zzaudio.github.io/SpecMaskFoley_Demo/

Supplementary Material: zip

Submission Number: 2

Loading