Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models
Keywords: Video hallucination, Large Multimodal Model
TL;DR: Video hallucination in multimodal model attributes to the improper internal world state.
Abstract: While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio-temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final-answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce $\textbf{STEMO-Bench}$ ($\textbf{S}$patio-$\textbf{TE}$mporal $\textbf{MO}$nitoring), a benchmark of human-verified object-centric facts that evaluates intermediate reasoning by decomposing queries into sub-questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose $\textbf{STEMO-Track}$, a novel object-centric framework that explicitly constructs and reasons over structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments demonstrate that our object-centric framework significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 240
Loading