3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

Published: 23 Feb 2026, Last Modified: 08 May 2026Published at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026EveryoneCC BY 4.0
Abstract: Large multimodal models are increasingly used as the rea- soning core of embodied agents operating in 3D environ- ments, yet they remain prone to hallucinations that can pro- duce unsafe and ungrounded decisions. Existing inference- time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spa- tial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference- time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geomet- ric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insen- sitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consis- tently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over struc- tured 3D representations as an effective and practical route to more reliable embodied intelligence. Project Link: PLAN Lab https://plan-lab.github.io/3d-vcd
Loading