HiFi-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

HiFi-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

ICLR 2026 Conference Submission18324 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video-to-audio, Audio Generation, Foley Generation, Multi-modal, Representation Alignment

TL;DR: HiFi-Foley is a novel Text-Video-to-Audio model that generates high-quality, perfectly synced sound for videos, outperforming SOTA methods.

Abstract: Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modal semantic response imbalance, and limited audio quality in existing methods, we propose HiFi-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a novel multimodal diffusion transformer that addresses semantic response imbalance between video and text modalities through dual-stream audio-video fusion via joint attention and balanced textual semantic injection via cross-attention; (2) a representation alignment training strategy that employs self-supervised audio features to guide latent diffusion training, thereby improving audio quality and semantic consistency; (3) a scalable data pipeline leveraging open-source tools for cleaning raw data and constructing training datasets. Extensive evaluations demonstrate that HiFi-Foley achieves state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment, and distribution matching.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 18324

Loading