Abstract: Understanding 3D scenes with point cloud data in tasks such as object referencing, question-answering, and captioning poses significant challenges to vision language models (VLMs), due to the complexity of integrating both linguistic and spatial information. While existing methods have mapped point cloud features into LLM space to enable 3D scene comprehension, they often overlook viewpoint information and the relative spatial distance between objects, this can lead to confusion in interpreting spatial descriptions and grounding objects. This paper presents a geometry-enhanced vision LM (GeVLM) to address these challenges. Specifically, we propose viewpoint-consistent position encoding (VCPE) to enhance the relative spatial relationship representation agnostic to the camera viewpoint, and propose the distance-aware cross-entropy (DACE) loss to incorporate distance information in the label space. We additionally introduce the DetailedScanRefer dataset, which provides identifiers and spatial annotation for each object mentioned in the referencing description to further emphasize spatial relationships. GeVLM demonstrates significant improvements over the strong ChatScene baseline, particularly with 1.3% Acc@0.25 and 1.0% Acc@0.50 improvements on the multiple object setup and state-of-the-art overall performance on ScanRefer dataset\footnote{We have made all the code, model checkpoints, and DetailedScanrefer used in this work available at \url{https://anonymous.4open.science/r/GeVLM-1372/}}.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, cross-modal application, multimodality
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 1155
Loading