Keywords: 3D Visual Grounding, 3D Vision Language, Benchmark
TL;DR: We introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2.8K referring expression-3D bounding box pairs spanning four different grounding levels: area, space, object, and part.
Abstract: 3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, individual objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.00% accuracy on space-level tasks and 31.46% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models’ capacity to understand and reason about 3D scenes beyond object-level semantics.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/txwang98/Anywhere3D_v2
Code URL: https://github.com/anywhere-3d/Anywhere3D
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 952
Loading