What One View Reveals, Another Conceals: 3D-Consistent Visual Reasoning with LLMs

What One View Reveals, Another Conceals: 3D-Consistent Visual Reasoning with LLMs

TMLR Paper6940 Authors

09 Jan 2026 (modified: 19 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Maintaining semantic label consistency across multiple views is a persistent challenge in 3D semantic object detection. Existing zero-shot approaches that combine 2D detections with vision-language features often suffer from bias toward non-descriptive viewpoints and require a fixed label list to operate on. We propose a truly open-vocabulary algorithm that uses large language model (LLM) reasoning to relabel multi-view detections, mitigating errors from poor, ambiguous viewpoints and occlusions. Our method actively samples informative views based on feature diversity and uncertainty, generates new label hypotheses via LLM reasoning, and recomputes confidences to build a spatial-semantic representation of objects. Experiments on controlled single-object and multi-object scenes show double digit improvement, in accuracy and sampling rate over ubiquitous fusion methods using YOLO, and CLIP. We demonstrate in multiple cases that \textbf{L}LM-guided \textbf{A}ctive \textbf{D}etection and \textbf{R}easoning (LADR) balances detail preservation with reduced ambiguity and low sampling rate. We provide theoretical convergence analysis showing exponential convergence to a stable and correct semantic label.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~C.V._Jawahar1

Submission Number: 6940

Loading