Evaluating Visual-to-Echo Distillation for Binaural Depth Prediction beyond Simulations

Nazrul Ismail; Owais Ahmed Malik; Wee Hong Ong

Evaluating Visual-to-Echo Distillation for Binaural Depth Prediction beyond Simulations

Nazrul Ismail, Owais Ahmed Malik, Wee Hong Ong

Published: 14 Jun 2026, Last Modified: 17 Jun 2026ICML 2026 Workshop MusIML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Cross-modal Distillation, Computer Vision, Binaural audio, Depth prediction

TL;DR: Evaluation of Distilled Visual to Echo Binaural Depth prediction network on Real-recordings dataset

Abstract: Echo reflections encode physical cues about object distance, geometry, and surface material that are useful for spatial reasoning. Prior works proposed to incorporate echo reflections as a modality into depth prediction through direct fusion or cross-modal knowledge distillation from vision to audio, but evaluation has been confined to simulated environments such as Replica and Matterport3D, leaving real-world viability untested. In this short paper, we evaluate Visual2Echo Compositional Contrastive Learning (V2E-CCL), a knowledge distillation framework that predicts depth using binaural echoes by aligning cross-modal representations in a shared latent space, on real binaural recordings from the BatVision dataset. To our knowledge this is the first evaluation of vision-to-echo distillation on real binaural recordings, indicating that the benefit of cross-modal distillation previously observed only in simulation also holds on real-world echoes. We further analyse failure modes specific to real echo capture.

Track: Track 2: ML Research by Muslim Authors

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.

Submission Number: 95

Loading