Keywords: 3D Visual Grounding, Multi-Modal, 3D, 3D Detection, Synthetic 2D generation.
TL;DR: We leverage 2D clues, synthetically generated from 3D point clouds, that empirically show their aptitude to boost the quality of the 3D learned visual representations.
Abstract: 3D visual grounding task has been explored with visual and language streams to comprehend referential language for identifying targeted objects in 3D scenes. However, most existing methods devote the visual stream to capture the 3D visual clues using off-the-shelf point clouds encoders. The main question we address is “can we consolidate the 3D visual stream by 2D clues and efficiently utilize them in both training and testing phases?”. The main idea is to assist the 3D encoder by incorporating rich 2D object representations without requiring extra 2D inputs. To this end, we leverage 2D clues, synthetically generated from 3D point clouds, that empirically show their aptitude to boost the quality of the learned visual representations. We validate our approach through comprehensive experiments on Nr3D, Sr3D, and ScanRefer datasets. Our experiments show consistent performance gains against counterparts, where our proposed module, dubbed as LAR, significantly outperforms state-of-the-art 3D visual grounding techniques on three benchmarks. Our code will be made publicly available.
Supplementary Material: pdf