HiViBiX: Hierarchical Visually-informed Binaural Audio Generation using Ambisonics

ICLR 2026 Conference Submission20341 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: binaural audio, hierarchical vision encoding, Ambisonics decoding
TL;DR: Multimodal mono-to-binaural generation using a hierarchical vision encoder and Ambisonics decoding.
Abstract: Binaural audio, a specialized form of stereo sound, provides depth and spatial localization for highly immersive listening experiences, making it fundamental in modern entertainment. Prior research has largely relied on visual cues to directly adapt mono signals into binaural or to estimate transfer functions that induce spatiality. In contrast, we introduce HiViBiX, a novel framework that redefines the audio representation by predicting first-order Ambisonics channels, which explicitly control the spatial positioning of audio components in the generated binaural signal. Unlike existing multimodal approaches that extract spatial cues exclusively from full-frame RGB images, HiViBiX incorporates a hierarchical visual encoder that jointly models local sound sources and their spatial depth with global environmental context. This design enables richer multimodal grounding and more precise spatialization. Extensive experiments on three widely used benchmarks: FAIR-Play, Music-Stereo, and YT-Music demonstrate that HiViBiX establishes new state-of-the-art performance for mono-to-binaural generation. Samples are available in the following repository: \href{https://hivibix.vercel.app}{https://hivibix.vercel.app}.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20341
Loading