Grounded Retrieval Generation Framework for VideoLLM Hallucination Mitigation

Published: 02 Oct 2025, Last Modified: 10 Oct 2025RIWM Non ArchivalEveryoneRevisionsBibTeXCC BY 4.0
Keywords: VideoLLMs, Hallucination, GRAVITI, Grounded Retrieval Generation
TL;DR: GRAVITI mitigates hallucinations in video-language models by grounding generation with retrieval from video features and metadata, improving accuracy and reliability across benchmarks.
Abstract: Video-language models (VideoLLMs) excel at tasks such as video captioning and question answering, but often produce hallucinations—content not grounded in the video or metadata—limiting their reliability. To address this, we propose GRAVITI (Grounded Retrieval Generation framework for VideoLLM hallucination mitigation), a model-agnostic, training-free and API-free framework that integrates a dynamically constructed ad-hoc knowledge base with a retrieval-guided decoding process. We refer to this process as Grounded Retrieval Generation (GRG), where each generated token is conditioned on evidence retrieved from video features and auxiliary metadata. GRAVITI reduces hallucinations while remaining compatible across diverse VideoLLMs. Evaluated on three benchmarks—VidHalluc, EventHallusion, and VideoHallucer—GRAVITI improves overall accuracy by 6–14% and substantially lowers hallucination rates compared to strong baselines. Ablation studies demonstrate the impact of retrieval size, detector thresholds, and grounding mechanisms, highlighting the effectiveness of GRG in producing reliable, multi-modal video descriptions.
Submission Number: 8
Loading