Abstract: Highlights•We have initiated a novel multi-modal task, termed 3D Visual Grounding-Audio(3DVG-Audio), which is based on the fusion of audio and point cloud data. To the best of our knowledge, this represents the first instance of an Audio-Point Cloud multi-modal task.•We have developed a new dataset named 3DVG-AudioSet, specifically designed for the training and evaluation of the 3DVG-Audio method.•We have crafted a tailored loss function and introduced a model named AP-Refer, which serves as a benchmark for 3DVG-Audio.
Loading