Keywords: Cross-modal Distillation, Computer Vision, Binaural audio, Depth prediction
TL;DR: Evaluation of Distilled Visual to Echo Binaural Depth prediction network on Real-recordings dataset
Abstract: Echo reflections encode physical cues about object distance, geometry, and surface material that are useful for spatial reasoning. Prior works proposed to incorporate echo reflections as a modality into depth prediction through direct fusion or cross-modal knowledge distillation from vision to audio, but evaluation has been confined to simulated environments such as Replica and Matterport3D, leaving real-world viability untested. In this short paper, we evaluate Visual2Echo Compositional Contrastive Learning (V2E-CCL), a knowledge distillation framework that predicts depth using binaural echoes by aligning cross-modal representations in a shared latent space, on real binaural recordings from the BatVision dataset. To our knowledge this is the first evaluation of vision-to-echo distillation on real binaural recordings, indicating that the benefit of cross-modal distillation previously observed only in simulation also holds on real-world echoes. We further analyse failure modes specific to real echo capture.
Track: Track 2: ML Research by Muslim Authors
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.
Submission Number: 95
Loading