Keywords: Large Language model, object detection, exploration, semantic labeling, Large Vision Model
TL;DR: An iterative method that actively uses an LLM to create consistant object detection in 3D space.
Abstract: Maintaining semantic label consistency across multiple views is a persistent challenge in 3D semantic object detection. Existing zero-shot approaches that combine 2D detections with vision-language features often suffer from bias toward non-descriptive viewpoints and require a fixed label list to operate on. We propose a truly open-vocabulary algorithm that uses large language model (LLM) reasoning to relabel multi-view detections, mitigating errors from poor, ambiguous viewpoints and occlusions. Our method actively samples informative views based on feature diversity and uncertainty, generates new label hypotheses via LLM reasoning, and recomputes confidences to build a spatial-semantic representation of objects. Experiments on controlled single-object and diverse multi-object scenes show over 40\% improvement, in accuracy and sampling rate over ubiquitous fusion methods using YOLO, and CLIP. We demonstrate in multiple cases that our LLM-guided Active Detection and Reasoning (LADR) balances detail preservation with reduced ambiguity and a low sampling rate.
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21740
Loading