Differentiable Room Acoustic Rendering with Multi-View Vision Priors

Published: 07 Aug 2025, Last Modified: 07 Aug 2025Gen4AVC PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: acoustic rendering, audio visual learning, room impulse response, spatial audio
Abstract: Spatial audio is essential for immersive AR/VR applications, yet existing existing methods for room impulse response estimation either needs dense training data or expensive physics simulation. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. This multi-modal, physics-based, end-to-end framework is efficient, interpretable, and accurate. Experiments across six real-world environments from two datasets demonstrate that AV-DAR significantly outperforms a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.
Supplementary Material: pdf
Submission Number: 4
Loading