HiFi-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
Keywords: Video-to-audio, Audio Generation, Foley Generation, Multi-modal, Representation Alignment
TL;DR: HiFi-Foley is a novel Text-Video-to-Audio model that generates high-quality, perfectly synced sound for videos, outperforming SOTA methods.
Abstract: Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modal semantic response imbalance, and limited audio quality in existing methods, we propose HiFi-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a novel multimodal diffusion transformer that addresses semantic response imbalance between video and text modalities through dual-stream audio-video fusion via joint attention and balanced textual semantic injection via cross-attention; (2) a representation alignment training strategy that employs self-supervised audio features to guide latent diffusion training, thereby improving audio quality and semantic consistency; (3) a scalable data pipeline leveraging open-source tools for cleaning raw data and constructing training datasets. Extensive evaluations demonstrate that HiFi-Foley achieves state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment, and distribution matching.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18324
Loading