3D Visual Grounding-Audio: 3D scene object detection based on audio

Can Zhang, Zeyu Cai, Xunhao Chen, Feipeng Da, Shaoyan Gai

Published: 01 Jan 2025, Last Modified: 13 Nov 2024Neurocomputing 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•We have initiated a novel multi-modal task, termed 3D Visual Grounding-Audio(3DVG-Audio), which is based on the fusion of audio and point cloud data. To the best of our knowledge, this represents the first instance of an Audio-Point Cloud multi-modal task.•We have developed a new dataset named 3DVG-AudioSet, specifically designed for the training and evaluation of the 3DVG-Audio method.•We have crafted a tailored loss function and introduced a model named AP-Refer, which serves as a benchmark for 3DVG-Audio.