GeVLM: 3D Object Grounding with Geometry-Enhanced Vision Language Model

GeVLM: 3D Object Grounding with Geometry-Enhanced Vision Language Model

ACL ARR 2024 December Submission1155 Authors

15 Dec 2024 (modified: 24 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Understanding 3D scenes with point cloud data in tasks such as object referencing, question-answering, and captioning poses significant challenges to vision language models (VLMs), due to the complexity of integrating both linguistic and spatial information. While existing methods have mapped point cloud features into LLM space to enable 3D scene comprehension, they often overlook viewpoint information and the relative spatial distance between objects, this can lead to confusion in interpreting spatial descriptions and grounding objects. This paper presents a geometry-enhanced vision LM (GeVLM) to address these challenges. Specifically, we propose viewpoint-consistent position encoding (VCPE) to enhance the relative spatial relationship representation agnostic to the camera viewpoint, and propose the distance-aware cross-entropy (DACE) loss to incorporate distance information in the label space. We additionally introduce the DetailedScanRefer dataset, which provides identifiers and spatial annotation for each object mentioned in the referencing description to further emphasize spatial relationships. GeVLM demonstrates significant improvements over the strong ChatScene baseline, particularly with 1.3% Acc@0.25 and 1.0% Acc@0.50 improvements on the multiple object setup and state-of-the-art overall performance on ScanRefer dataset\footnote{We have made all the code, model checkpoints, and DetailedScanrefer used in this work available at \url{https://anonymous.4open.science/r/GeVLM-1372/}}.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, cross-modal application, multimodality

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 1155

Loading