Abstract: Monocular 3D Visual Grounding (Mono3DVG) aims to predict the 3D localization of objects in monocular RGB images based on natural language descriptions. This task has broad applications in areas such as autonomous driving, human-computer interaction and robotic manipulation, making it both a significant and challenging task. To tackle this challenge, we propose a novel network architecture, the Monocular Selective Attention Learning Network (MSALNet). This network enhances the understanding and localization of objects by introducing an Adaptive Learning Module (ALM) and a Vision-Text Interaction Encoder. Specifically, the ALM further learns the features extracted from the 3D scene and textual descriptions, capturing the contextual relationships within the features. This enables the model to better understand the meaning of both the scene and the text. Meanwhile, the Vision-Text Interaction Encoder facilitates refined cross-modal interaction and fusion, promoting alignment between visual and textual information and providing more discriminative feature representations. Experimental results demonstrate that our method achieves competitive performance on the Mono3DRefer dataset.
External IDs:dblp:conf/ijcnn/GuoHBWSS25
Loading