Abstract: Large Vision-Language Models (LVLMs) exhibit sophisticated reasoning but remain susceptible to object hallucination. Deviating from the prevailing *attention intensity assumption*, we reveal a deeper dynamic structural misalignment: hallucination is triggered at decision-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors. This establishes a pathological shortcut that bypasses visual grounding. To dismantle this, we propose **Fox** (*F*aithfulness and *O*bservational-flow via e*X*pression-rectification), a training-free inference-time framework. **Fox** diagnoses structural misalignment using a visual attention entropy probe to localize risky mediators unsupervisedly. We then execute a targeted causal intervention via numerical logit saturation to physically sever the shortcut path. Finally, a conflict-gated cooperative decoding strategy reconciles interventional faithfulness with observational fluency. Extensive experiments demonstrate that **Fox** achieves SOTA performance, outperforming SID by $29.1\%$ while preserving linguistic richness. Code is available at <https://github.com/Cc2021start/Fox>.
Lay Summary: AI systems that answer questions about images can sometimes describe things that are not actually there, such as inventing objects or attributes. This makes them harder to trust in real-world uses where visual accuracy matters. Our work studies why these mistakes happen and finds that the model is not simply “looking too little” at the image. Instead, some parts of the model can rely too strongly on learned language habits, especially when deciding what to say next.
We propose Fox, a method that detects these risky parts during generation and reduces their influence, without retraining the model. The method also keeps the model’s ability to produce detailed and natural answers, so it does not become overly cautious. In tests across several image-language AI systems, Fox reduces false visual claims while keeping answers useful and fluent, with little extra cost. This work offers a practical step toward more reliable AI systems that can describe and reason about visual content more faithfully.
Primary Area: Social Aspects->Safety
Keywords: Hallucination mitigation, LVLMs, causal mechanism
Originally Submitted PDF: pdf
Submission Number: 3458
Loading