Adversarial Object Hallucination Attacks in Video-Language Models via Intermediate Feature Alignment

Adversarial Object Hallucination Attacks in Video-Language Models via Intermediate Feature Alignment

ICLR 2026 Conference Submission19729 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Large Language Models, Adversarial Attack, Object Hallucination

Abstract: Video Large Language Models (Vid-LLMs) have rapidly advanced video understanding, yet their robustness against semantic adversarial manipulation, especially object hallucination, remains largely unexplored. We introduce Adversarial Object Hallucination (AOH), a novel attack that compels Vid-LLMs to ``see" non-existent objects in videos by injecting visually imperceptible perturbations. Unlike prior attacks limited to inputs or outputs of videos, AOH directly manipulates intermediate connector features, aligning them with representations from a target video to induce controllable hallucinations. To systematically assess this threat, we curate a benchmark of 535 clean/target video pairs with high-quality VQA annotations. Extensive experiments show that AOH poses a severe threat to state-of-the-art Vid-LLMs, achieving highly effective attacks with alarming \emph{cross-scale transferability}: adversarial examples optimized on smaller models transfer even more strongly to larger counterparts of the same architecture, amplifying attack impact while reducing adversarial cost. Further analyses reveal that perturbations encode semantic object contours, while Grad-CAM highlights their covert influence. These findings expose a severe and previously overlooked vulnerability in Vid-LLMs, raising urgent concerns about their secure deployment and providing a foundation for future adversarial research in video-language modeling.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19729

Loading