Abstract: Large multimodal models are increasingly used as the rea-
soning core of embodied agents operating in 3D environ-
ments, yet they remain prone to hallucinations that can pro-
duce unsafe and ungrounded decisions. Existing inference-
time hallucination mitigation methods largely target 2D
vision-language settings and do not transfer to embodied 3D
reasoning, where failures arise from object presence, spa-
tial layout, and geometric grounding rather than pixel-level
inconsistencies. We introduce 3D-VCD, the first inference-
time visual contrastive decoding framework for hallucination
mitigation in 3D embodied agents. 3D-VCD constructs a
distorted 3D scene graph by applying semantic and geomet-
ric perturbations to object-centric representations, such as
category substitutions and coordinate or extent corruption.
By contrasting predictions under the original and distorted
3D contexts, our method suppresses tokens that are insen-
sitive to grounded scene evidence and are therefore likely
driven by language priors. We evaluate 3D-VCD on the
3D-POPE and HEAL benchmarks and show that it consis-
tently improves grounded reasoning without any retraining,
establishing inference-time contrastive decoding over struc-
tured 3D representations as an effective and practical route
to more reliable embodied intelligence.
Project Link: PLAN Lab https://plan-lab.github.io/3d-vcd
Loading